Being our DevOps guy I get to be involved in a variety of projects. I was recently tasked with rebuilding our demo database to bring it up to date with our production system. Our production database is ~15T but we only want a subset of that data for demonstration purposes. The task is to pick a client or two and export a referentially correct set of data from production and import into a demo database. Once we have the data, it needs to be scrubbed so there are no traceable demographic information or client specific details.
Side Note: We previously performed this task using Oracle’s DataPump utility (EXPDP and IMPDP) to export Schema + Data based on a configuration file (INCLUDE, EXLUDE, QUERY rules). That worked pretty well but was very time consuming to generate the conf file and run the export/import. This approach requires DBA level access to both source and target databases.
What is Database Subsetting?
It is the process of creating a referentially correct cut down “subset” from a larger database. Ideal for creating test, development, or demo databases from a much larger staging or production instance.
I was hoping the open source Jailer project would work but it would have been very impractical with a database of our size.
I then came across www.grid-tools.com and downloaded a demo version. I quickly found out that all of the features I wanted to evaluate were all disabled in their demo version. I stopped there since it was going to involve sales people and live demos just to get a full version I could play with.
Then I came across databee.com and I was hooked. DataBee offered a full featured, 30 day trial of their software. You have to request an evaluation license but that’s acceptable to me. I was contacted by a rep from the company who was 1) very friendly, 2) straightforward with pricing, 3) genuinely wanted to help, and 4) not pushy about the sale.
The software was a quick install (Windows only) and has a great interface and is very intuitive (at least to a technical person). I ended up doing a live demo with them a couple days later and was even more impressed by the software. The part I love is how easy it is to look under the hood. Log files are readily available, all the PL/SQL code that is being executed can be found with a click of a button, and they provide a number of tools that show you exactly how your database relationships are setup, and how to generate the appropriate rules.
Once I got DataBee down and was able to create a subset of data (over Oracle database links.. yay!), I wanted to try masking/scrubbing this data. The rep pointed out DataMasker to me and I went through a demo on that as well. It was so easy to identify schemas, tables, and columns using a set of rules that would substitute and then synchronize data between tables. You can even run arbitrary PL/SQL code as a masking rule. They provide around 60 different data sets: First/Last Names, Date Variance, SSN, Phone numbers, Address, City, State, Zip, etc.
Database Subsetting is non-trivial but the DataBee product makes it look easy. I highly value a software vendor who is honestly willing to help and not be pushy when it comes to the sales side of things.
One last point. The thing that hooked me was how responsive the reps were to our needs. I used DataBee on a table of ours with non-standard column names (lower case and Oracle reserved works — bad idea I know, it’s a legacy table). I was able to grab the exact code causing the problem, submitted it to them with a description of what was happening, and they sent me a new install with the problem fixed in about 24 hours. You know a software company is doing something right when they can patch and ship in a short amount of time.