We live in a data-intensive, data-driven world! Organizations of all types collect, store, process, analyze data and use it to inform and improve their decision-making processes. The AWS Cloud is well-suited to all of these activities; it offers vast amounts of storage, access to any conceivable amount of compute power, and many different types of analytical tools.
In addition to generating and working with data internally, many organizations generate and then share data sets with the general public or within their industry. We made some initial steps to encourage this back in 2008 with the launch of AWS Public Data Sets (Paging Researchers, Analysts, and Developers). That effort has evolved into the Registry of Open Data on AWS (New – Registry of Open Data on AWS (RODA)), which currently contains 118 interesting datasets, with more added all the time.
New AWS Data Exchange
Today, we are taking the next step forward, and are launching AWS Data Exchange. This addition to AWS Marketplace contains over one thousand licensable data products from over 80 data providers. There’s a diverse catalog of free and paid offerings, in categories such as financial services, health care / life sciences, geospatial, weather, and mapping.
If you are a data subscriber, you can quickly find, procure, and start using these products. If you are a data provider, you can easily package, license, and deliver products of your own. Let’s take a look at Data Exchange from both vantage points, and then review some important details.
Let’s define a few important terms before diving in:
Data Provider – An organization that has one or more data products to share.
Data Subscriber – An AWS customer that wants to make use of data products from Data Providers.
Data Product – A collection of data sets.
Data Set – A container for data assets that belong together, grouped by revision.
Revision – A container for one or more data assets as of a point in time.
Data Asset – The actual data, in any desired format.
AWS Data Exchange for Data Subscribers
As a data subscriber, I click View product catalog and start out in the Discover data section of the AWS Data Exchange Console:
Products are available from a long list of vendors:
I can enter a search term, click Search, and then narrow down my results to show only products that have a Free pricing plan:
I can also search for products from a specific vendor, that match a search term, and that have a Free pricing plan:
The second one looks interesting and relevant, so I click on 5 Digit Zip Code Boundaries US (TRIAL) to learn more:
I think I can use this in my app, and want to give it a try, so I click Continue to subscribe. I review the details, read the Data Subscription Agreement, and click Subscribe:
The subscription is activated within a few minutes, and I can see it in my list of Subscriptions:
Then I can download the set to my S3 bucket, and take a look. I click into the data set, and find the Revisions:
I click into the revision, and I can see the assets (containing the actual data) that I am looking for:
I select the asset(s) that I want, and click Export to Amazon S3. Then I choose a bucket, and Click Export to proceed:
This creates a job that will copy the data to my bucket (extra IAM permissions are required here; read the Access Control documentation for more info):
The jobs run asynchronously and copy data from Data Exchange to the bucket. Jobs can be created interactively, as I just showed you, or programmatically. Once the data is in the bucket, I can access and process it in any desired way. I could, for example, use a AWS Lambda function to parse the ZIP file and use the results to update a Amazon DynamoDB table. Or, I could run an AWS Glue crawler to get the data into my Glue catalog, run an Amazon Athena query, and visualize the results in a Amazon QuickSight dashboard.
Subscription can last from 1-36 months with an auto-renew option; subscription fees are billed to my AWS account each month.
AWS Data Exchange for Data Providers
Now I am going to put my “data provider” hat and show you the basics of the publication process (the User Guide contains a more detailed walk-through). In order to be able to license data, I must agree to the terms and conditions, and my application must be approved by AWS.
After I apply and have been approved, I start by creating my first data set. I click Data sets in the navigation, and then Create data set:
I describe my data set, and have the option to tag it, then click Create:
Next, I click Create revision to create the first revision to the data set:
I add a comment, and have the option to tag the revision before clicking Create:
I can copy my data from an existing S3 location, or I can upload it from my desktop:
I choose the second option, select my file, and it appears as an Imported asset after the import job completes. I review everything, and click Finalize for the revision:
My data set is ready right away, and now I can use it to create one or more products:
The console outlines the principal steps:
I can set up public pricing information for my product:
AWS Data Exchange lets me create private pricing plans for individual customers, and it also allows my existing customers to bring their existing (pre-AWS Data Exchange) licenses for my products along with them by creating a Bring Your Own Subscription offer.
I can use the provided Data Subscription Agreement (DSA) provided by AWS Data Exchange, use it as the basis for my own, or I can upload an existing one:
I can use the AWS Data Exchange API to create, update, list, and manage data sets and revisions to them. Functions include CreateDataSet
, UpdataSet
, ListDataSets
, CreateRevision
, UpdateAsset
, and CreateJob
.
Things to Know
Here are a couple of things that you should know about Data Exchange:
Subscription Verification – The data provider can also require additional information in order to verify my subscription. If that is the case, the console will ask me to supply the info, and the provider will review and approve or decline within 45 days:
Here is what the provider sees:
Revisions & Notifications – The Data Provider can revise their data sets at any time. The Data Consumer receives a CloudWatch Event each time a product that they are subscribed to is updated; this can be used to launch a job to retrieve the latest revision of the assets. If you are implementing a system of this type and need some test events, find and subscribe to the Heartbeat product:
Data Categories & Types – Certain categories of data are not permitted on AWS Data Exchange. For example, your data products may not include information that can be used to identify any person, unless that information is already legally available to the public. See, Publishing Guidelines for detailed guidelines on what categories of data are permitted.
Data Provider Location – Data providers must either be a valid legal entity domiciled in the United States or in a member state of the EU.
Available Now
AWS Data Exchange is available now and you can start using it today. If you own some interesting data and would like to publish it, start here. If you are a developer, browse the product catalog and look for data that will add value to your product.
— Jeff;
Source: AWS News