While I worked at AWS during the publishing of this post / video, the views expressed here are my own and may not reflect those of my employer. Only publicly available material is used to put together the content of this article and video. The website and the Youtube videos are NOT monetized.

You can directly scroll down for the Youtube Video. Documents and regular expressions used in the video are provided at the end of this article.

The first step of protecting your data is to know about your data.

With any data protection strategy, you should start with classification of the data into following categories.

  1. Low risk - data that is publicly available
  2. Moderate risk - data that is not public and is important for functioning of the application / organization. E.g. product, company documentation etc.
  3. High risk - Data that contains confidential and sensitive information about the organization, project, product or most importantly the customers of the organization.

Such classification then helps the organization in determining what sort of access controls should be placed around the classified data. Which basically forms the step 1 of your data protection strategy.

Now a days organizations store huge amounts of data about their customers. Ideally this data should classified as soon as possible. However in practice this is rarely the case, and there are good reasons for it.

Pretty soon your organization can end up with millions of files in 100s of folders (or buckets) and potentially sitting on the next data breach due to inappropriate access(es). And however daunting the classification was at the start - now it is even more so.

You can never do it manually in a practical way. Opening every file and checking if it has sensitive data and if yes then moving into more controlled bucket.

While it could be done - It can take days if not weeks and the whole exercise is error prone.

And by the time you do it, you already have thousands of new files to go through.

This is where Amazon Macie comes into picture.

Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in AWS.

With macie all you have to do is

  1. Enable Macie
  2. Create a job (one time or scheduled) to find sensitive data
  3. Macie provides you findings to help you classify and protect your data and put proper access controls around it.

It’s really that simple.

Check out video below where Macie helps us classify documents based on whether they contain PAN and AADHAR numbers along with standard PII data.


Please watch in full screen or on youtube directly

Documents used in this video

  1. sample salary-slip
  2. sample passport number file

Regular expressions used in this video


Thank you for reading through, Please share if it’s useful to someone.


comments powered by Disqus