Structured Data Discovery for Databases: Do's and Dont's
Posted by Prat Moghe on Thu, Nov 06, 2008
Recent enterprise data discovery projects have just begun to realize that database discovery (of structured data) is the next big challenge.
Two use cases for discovery I have encountered commonly are:
(1) Identifying when production data moves over to test databases so that this data can be masked.
(2) Inventoring databases with sensitive data (credit cards, SSN) - particularly unencrypted or unmasked data
The reason the structured data discovery problem is hard is that it involves many unknowns and has classic manageability issues around it. First, many enterprises dont even know where their databases are, or how many they are, or what type they are (Oracle, DB2, SQL Server, Sybase, etc). This is the database inventory problem. Two, enterprises dont understand how to scan these databases for credit cards or production data patterns (e.g. fund information) in a manageable way. For instance, you cant scan a database without having credentials to login as an adminstrator. Additionally, doing a verbose scan of a million-row database table or column can bring down the database to its knees, not to mention the challenges with where to store the result set.
I have a few suggestions on tackling this problem by making it simple. Break the project into mini-steps and crawl before you walk (literally in this case!):
(1) Start with database discovery first. Also do a schema discovery and a role discovery while you are at it.
(2) For data discovery scans, do a network-based discovery first that does not need credentials. It will help reduce the search space of likely databases that have critical information. This should also give you some usage discovery - in terms of how data is being accessed by users, time, location etc. Databases that are rarely accessed have lower risk and you may be able to get to their discovery down the road.
(3) Once you understand roles (see step 1 above) and obtain credentials you are ready for data-at-rest scan. Start with a "sampled" data-at-rest scan to keep the performance impact manageable. For example, a scan that might sample X% of the rows to see if there is a match with a pattern. You can follow up the sampled scan with a more detailed data inventory scan.
(4) After data discovery has completed, translate the results into risk assessment metrics. For instance, a database with large number of unencrypted credit card records has a high risk.
(5) Depending on the risk metrics, specific risk mitigation techniques may need to be adopted. For instance, deletion or masking or encryption of sensitive fields could be required. Alternatively, monitoring access to sensitive data fields may be an acceptable risk management strategy. Check with your assessor or auditor or with the best practices.
A recent article covers this problem from a life-cycle perspective. Check out http://www.scmagazineus.com/The-data-discovery-challenge/article/120467/