Launched at AWS re:Invent 2017, Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for AWS customers to add speech-to-text capabilities to their applications. At the time of writing, Transcribe supports 31 languages, 6 of which can be transcribed in real-time.
A popular use case for Transcribe is the automatic transcription of customer calls (call centers, telemarketing, etc.), in order to build data sets for downstream analytics and natural language processing tasks, such as sentiment analysis. Thus, any Personally Identifiable Information (PII) should be removed to protect privacy, and comply with local laws and regulations.
As you can imagine, doing this manually is quite tedious, time-consuming, and error-prone, which is why we’re extremely happy to announce that Amazon Transcribe now supports automatic redaction of PII.
Introducing Content Redaction in Amazon Transcribe
If instructed to do so, Transcribe will automatically identify the following pieces of PII:
They will be replaced with a ‘[PII]’ tag in the transcribed text. You also get a redaction confidence score (instead of the usual ASR score), as well as start and end timestamps. These timestamps will help you locate PII in your audio files for secure storage and sharing, or for additional audio processing to redact it at the source.
This feature is extremely easy to use, let’s do a quick demo.
Redacting Personal Information with Amazon Transcribe
First, I’ve recorded a short sound file full of personal information (of course, it’s all fake). I’m using the mp3 format here, but we recommend that you use lossless formats like FLAC or WAV for maximum accuracy.
$ aws s3 cp julien.mp3 s3://jsimon-transcribe-us-east-1
<?php require 'aws.phar'; use AwsTranscribeServiceTranscribeServiceClient; $client = new TranscribeServiceClient([ 'profile' => 'default', 'region' => 'us-east-1', 'version' => '2017-10-26' ]); $result = $client->startTranscriptionJob([ 'LanguageCode' => 'en-US', 'Media' => [ 'MediaFileUri' => 's3://jsimon-transcribe-us-east-1/julien.mp3', ], 'MediaFormat' => 'mp3', 'OutputBucketName' => 'jsimon-transcribe-us-east-1', 'ContentRedaction' => [ 'RedactionType' => 'PII', 'RedactionOutput' => 'redacted' ], 'TranscriptionJobName' => 'redaction' ]); ?>
A single API call is really all it takes. The
RedactionOutput parameter lets me control whether I want both the full and the redacted output, or just the redacted output. I go for the latter. Now, let’s run this script.
$ php transcribe.php
Immediately, I can see the job running in the Transcribe console.
$ aws s3 cp s3://jsimon-transcribe-us-east-1/redacted-redactiontest.json .
The transcription is a JSON document containing detailed information about each word. Here, I’m only interested in the full transcript, so I use a nice open source tool called jq to filter the document.
$ cat redacted-redactiontest.json| jq '.results.transcripts'
"transcript": "Good morning, everybody. My name is [PII], and today I feel like sharing a whole lot of personal information with you. Let's start with my Social Security number [PII]. My credit card number is [PII] And my C V V code is [PII] My bank account number is [PII] My email address is [PII], and my phone number is [PII]. Well, I think that's it. You know a whole lot about me. And I hope that Amazon transcribe is doing a good job at redacting that personal information away. Let's check."
Well done, Amazon Transcribe. My privacy is safe.
The content redaction feature is available for US English in the following regions:
Source: AWS News