Machine learning definitely offers a wide range of exciting topics to work on, but there’s nothing quite like personalization and recommendation.
At first glance, matching users to items that they may like sounds like a simple problem. However, the task of developing an efficient recommender system is challenging. Years ago, Netflix even ran a movie recommendation competition with a $1 Million award! Indeed, building, optimizing and deploying real-time personalization today requires specialized expertise in analytics, applied machine learning, software engineering, and systems operations. Few organizations have the knowledge, skills, and experience to overcome these challenges, and they either abandon the idea of using recommendation or build under-performing models.
For over 20 years, Amazon.com has built recommender systems at scale, integrating personalized recommendations across the buying experience – from product discovery to checkout.
To help all AWS customers do the same, we are very happy to announce Amazon Personalize, a fully-managed service that puts personalization and recommendation in the hands of developers with little machine learning experience.
Introducing Amazon Personalize
How does Amazon Personalize simplify personalization and recommendation? As explained in a previous blog post, you could already build recommendation models on Amazon SageMaker using algorithms such as Factorization Machines. However, it’s fair to say that this requires extensive data preparation and expert tuning in order to get good results.
Creating a recommendation model with Amazon Personalize is much simpler. Using AutoML, a new process that automates complex machine learning tasks, Personalize performs and accelerates the difficult work required to design, train, and deploy a machine learning model.
Amazon Personalize supports both datasets stored in Amazon S3 and streaming data sets, e.g. events sent in real-time from a JavaScript tracker or server-side. The high-level process looks like this:
With data stored in Amazon S3, sending data to Personalize simply means adding your data files to the dataset group. Ingestion is triggered automatically.
Working with streaming data is different. One way to send events would be to use the AWS Amplify JavaScript library, which is integrated with the event tracking service in Personalize. Another way would be to send them server-side via the AWS SDK in your favourite language: ingestion can happen from any source with the code hosted inside of AWS (e.g. in Amazon EC2 or AWS Lambda) or outside.
Time for an example. Let’s build a solution based on the MovieLens dataset!
The MovieLens dataset
MovieLens is a well-known dataset storing movies recommendations. It comes in different sizes and formats: here, we will use ml-20m, which contains 20 million ratings applied to 27,000 movies by 138,000 users.
This dataset contains a file named ‘ratings.csv’ storing user-item interactions. The first lines look like this.
userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
It reads like this: user 1 gave movie 2 a 3.5 rating. Same for movies 29, 32, 47, 50 and so on! This is exactly what we need to build a recommendation model. Let’s get to work.
Creating a schema for the dataset
The first step is to create an Avro schema for this dataset. This is pretty straightforward, we just need to use some of the keywords defined in Amazon Personalize.
{"type": "record",
"name": "Interactions",
"namespace": "com.amazonaws.personalize.schema",
"fields":[
{"name": "ITEM_ID", "type": "string"},
{"name": "USER_ID", "type": "string"},
{"name": "TIMESTAMP", "type": "long"}
],
"version": "1.0"}
Preparing the dataset
Once we’ve downloaded and unzipped the dataset, let’s load the ‘ratings.csv’ file and apply the following processing:
All of this is easily achieved with the Pandas Python library, the Swiss Army knife for columnar data processing. While we’re at it, we’ll also upload the processed file to an Amazon S3 bucket.
import pandas, boto3 from sklearn.utils import shuffle ratings = pandas.read_csv('ratings.csv') ratings = shuffle(ratings) ratings = ratings[ratings['rating']>3.6] ratings = ratings.drop(columns='rating') ratings.columns = ['USER_ID','ITEM_ID','TIMESTAMP'] ratings = ratings[:100000] ratings.to_csv('ratings.processed.csv',index=False) s3 = boto3.client('s3') s3.upload_file('ratings.processed.csv','jsimon-ml20m','ratings.processed.csv')
Creating the dataset group
First, we need to create a dataset group containing the user-item dataset as well as its schema. Let’s do this with the AWS CLI: as you’ll see, a lot of these CLI operations require Amazon Resource Names (ARNs) output by a previous call, so make sure you keep track of everything when you experiment.
$ aws personalize create-dataset-group --name jsimon-ml20m-dataset-group
$ aws personalize create-schema --name jsimon-ml20m-schema
--schema file://jsimon-ml20m-schema.json
$ aws personalize create-dataset --schema-arn $SCHEMA_ARN
--dataset-group-arn $DATASET_GROUP_ARN
--dataset-type INTERACTIONS
Importing datasets
In this simple example, we’ll import data on-demand. It’s also possible to schedule import jobs in order to load new data regularly. We need to pass a role allowing data to be read from the Amazon S3 bucket.
$ aws personalize create-dataset-import-job --job-name jsimon-ml20m-job
--role-arn $ROLE_ARN
--dataset-arn $DATASET_ARN
--data-source dataLocation=s3://jsimon-ml20m/ratings.processed.csv
This will take a little while and we can use the describe-dataset-import-job API to check for completion. Plenty of information is returned, but let’s just query the import status.
$ aws personalize describe-dataset-import-job
--dataset-import-job-arn $DATASET_IMPORT_JOB_ARN
--query "datasetImportJob.latestDatasetImportJobRun.status"
"CREATE IN_PROGRESS"
Putting it all together: creating a solution
Once datasets have been imported, we need to select a recipe to cook our recommendation model. A recipe is much more than an algorithm: it also includes predefined feature transformation, initial parameters for the algorithm as well as automatic model tuning. Thus, recipes remove the need to have expertise in personalization.
Amazon Personalize comes with several recipes suitable for different use cases, and advanced users can also add their own recipes.
Here’s the list of available recipes.
arn:aws:personalize:::recipe/awspersonalizehrnnmodel
arn:aws:personalize:::recipe/awspersonalizehrnnmodel-for-coldstart
arn:aws:personalize:::recipe/awspersonalizehrnnmodel-for-metadata
arn:aws:personalize:::recipe/awspersonalizeffnnmodel
arn:aws:personalize:::recipe/awspersonalizedeepfmmodel
arn:aws:personalize:::recipe/awspersonalizesimsmodel
arn:aws:personalize:::recipe/search-personalization
arn:aws:personalize:::recipe/popularity-baseline
Recommendation experts will certainly enjoy the flexibility that they bring, but what about developers who are new to the topic?
As mentioned earlier, Amazon Personalize supports AutoML, a new technique that automatically searches for the most optimal recipe, so let’s enable it. Hyper parameter optimization is enabled by default. Last but not least, Amazon Personalize solutions can scale automatically according to incoming traffic: we simply need to define the minimum number to transactions per second (TPS) that we want to support.
Thus, we can create the solution like so:
$ aws personalize create-solution --name jsimon-ml20m-solution
--minTPS 10 --perform-auto-ml
--dataset-group-arn $DATASET_GROUP_ARN
--query 'solution.status'
"CREATE IN_PROGRESS"
This will take a little while as the optimal recipe is selected, trained and tuned. Once all of this is complete, we can look at solution metrics.
$ aws personalize get-metrics --solution-arn $SOLUTION_ARN
Recommending new items in real-time
If we’re happy with the model, we can now create a campaign in order to deploy it. It will be updated automatically every time the solution is deployed.
$ aws personalize create-campaign --name jsimon-ml20m-solution
--solution-arn $SOLUTION_ARN --update-mode AUTO
Now, let’s recommend some movies.
$ aws personalize-rec get-recommendations --campaign-arn $CAMPAIGN_ARN
--user-id $USER_ID --query "itemList[*].itemId"
["1210", "260", "2571", "110", "296", "1193", ...]
That’s it! As you can see, we successfully built a recommendation model with a few API calls. All we had to do was define a schema and upload the dataset. We relied on Amazon Personalize to select the best recipe with AutoML, and to optimize its hyper parameters. The solution was trained and deployed on fully-managed infrastructure, letting us focus even more on building our application.
Sign up for the preview now!
I hope this post was informative. We just scratched the surface of what Amazon Personalize can do. The service is available in preview in US-East (Virginia) and US-West (Oregon).
There is no charge for the service during the preview. Once the preview is complete, the service will be part of the AWS free tier. For the first two months after sign-up, you will be offered:
1. Data processing and storage: Up to 20 GB per month
2. Training: Up to 100 training hours per month
3. Inference: Up to 50 TPS-hours of real-time recommendations per month
To get started, visit aws.amazon.com/personalize/. Now it’s your turn to try it and let us know what you think.
— Julien;
Source: AWS News