Glen Knight: NYC Based IT Professional

Now available in Amazon Transcribe: Automatic Redaction of Personally Identifiable Information

February 26, 2020 by knightglen_sruobz

Launched at AWS re:Invent 2017, Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for AWS customers to add speech-to-text capabilities to their applications. At the time of writing, Transcribe supports 31 languages, 6 of which can be transcribed in real-time.

A popular use case for Transcribe is the automatic transcription of customer calls (call centers, telemarketing, etc.), in order to build data sets for downstream analytics and natural language processing tasks, such as sentiment analysis. Thus, any Personally Identifiable Information (PII) should be removed to protect privacy, and comply with local laws and regulations.

As you can imagine, doing this manually is quite tedious, time-consuming, and error-prone, which is why we’re extremely happy to announce that Amazon Transcribe now supports automatic redaction of PII.

Introducing Content Redaction in Amazon Transcribe
If instructed to do so, Transcribe will automatically identify the following pieces of PII:

Social Security Number,
Credit card/Debit card number,
Credit card/Debit card expiration date,
Credit card/Debit card CVV code,
Bank account number,
Bank routing number,
Debit/Credit card PIN,
Name,
Email address,
Phone number (10 digits),
Mailing address.

They will be replaced with a ‘[PII]’ tag in the transcribed text. You also get a redaction confidence score (instead of the usual ASR score), as well as start and end timestamps. These timestamps will help you locate PII in your audio files for secure storage and sharing, or for additional audio processing to redact it at the source.

This feature is extremely easy to use, let’s do a quick demo.

Redacting Personal Information with Amazon Transcribe
First, I’ve recorded a short sound file full of personal information (of course, it’s all fake). I’m using the mp3 format here, but we recommend that you use lossless formats like FLAC or WAV for maximum accuracy.

Then, I upload this file to an S3 bucket using the AWS CLI.

$ aws s3 cp julien.mp3 s3://jsimon-transcribe-us-east-1

The next step is to transcribe this sound file using the StartTranscriptionJob API: why not use the AWS SDK for PHP this time?

<?php
require 'aws.phar';

use AwsTranscribeServiceTranscribeServiceClient;

$client = new TranscribeServiceClient([
    'profile' => 'default',
    'region' => 'us-east-1',
    'version' => '2017-10-26'
]);

$result = $client->startTranscriptionJob([
    'LanguageCode' => 'en-US',
    'Media' => [
        'MediaFileUri' => 's3://jsimon-transcribe-us-east-1/julien.mp3',
    ],
    'MediaFormat' => 'mp3',
    'OutputBucketName' => 'jsimon-transcribe-us-east-1',
    'ContentRedaction' => [
        'RedactionType' => 'PII',
        'RedactionOutput' => 'redacted'
    ],
    'TranscriptionJobName' => 'redaction'
]);
?>

A single API call is really all it takes. The RedactionOutput parameter lets me control whether I want both the full and the redacted output, or just the redacted output. I go for the latter. Now, let’s run this script.

$ php transcribe.php

Immediately, I can see the job running in the Transcribe console.

I could also use the GetTranscriptionJob and ListTranscriptionJobs APIs to check that content redaction has been applied. Once the job is complete, I simply fetch the transcription from my S3 bucket.

$ aws s3 cp s3://jsimon-transcribe-us-east-1/redacted-redactiontest.json .

The transcription is a JSON document containing detailed information about each word. Here, I’m only interested in the full transcript, so I use a nice open source tool called jq to filter the document.

$ cat redacted-redactiontest.json| jq '.results.transcripts'
[{
"transcript": "Good morning, everybody. My name is [PII], and today I feel like sharing a whole lot of personal information with you. Let's start with my Social Security number [PII]. My credit card number is [PII] And my C V V code is [PII] My bank account number is [PII] My email address is [PII], and my phone number is [PII]. Well, I think that's it. You know a whole lot about me. And I hope that Amazon transcribe is doing a good job at redacting that personal information away. Let's check."
}]

Well done, Amazon Transcribe. My privacy is safe.

Now available!
The content redaction feature is available for US English in the following regions:

US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), AWS GovCloud (US-West),
Canada (Central), South America (São Paulo),
Europe (Ireland), Europe (London), Europe (Paris), Europe (Frankfurt),
Middle East (Bahrain),
Asia Pacific (Mumbai), Asia Pacific (Hong Kong), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo).

Take a look at the pricing page, give the feature a try, and please send us feedback either in the AWS forum for Amazon Transcribe or through your usual AWS support contacts.

– Julien

Source: AWS News

Glen Knight

NYC Based IT Professional

Now available in Amazon Transcribe: Automatic Redaction of Personally Identifiable Information

Leave a Reply Cancel reply