Ask HN: What tools to grab slices of production?

11 points by was_boring 5 years ago

Where I work, it has become infeasible to grab the entire production database for local testing, debugging or dev. Not to mention the regulatory climate changing which discourages this.

What tools do you use to grab just slices of it? We use Postgres as our db.

jrowley 5 years ago

I have this problem too and don’t have a good solution. One approach I’ve been thinking about doing is to just automatically deploy yesterday’s RDS snapshot to a dedicated test instance that rolls over every 24 hours and is replaced. It would dedicated for testing tricky migrations and support cases.

devm0de 5 years ago

We just take a weekly snapshot where we run a script to drop most customer data, sanitize the little bit that is left, add our internal users and data and apply the grants needed for each person internally. This image is built and everyone can pull down the container. New grants preferably wait until the weekly snapshot happens.

bradwood 5 years ago

If you have the cash in your firm you could look at Delphix. It only works on RDBMSs but allows databases to be “forked”, branched committed to, rolled back and datamasked.

davismwfl 5 years ago

It depends on if you have a shared dev environment or a truly local laptop environment and whether or not you have any restrictions on data usage from prod. I have had to deal with this a lot over my career as most of the time the production databases have been quite large. And in many cases, privacy or security rules prevented us from just copying the data straight away. If you have no restrictions it is easiest to do snapshots of prod on regular intervals and just restore a snapshot into Staging or local etc.

Outside of that, the other methods I have seen & used varied over the years, but I'll share the most common pattern I have used which seemed to work the best, but it does require some work to setup and maintain.

For SQL specifically, we used to write a number of functions (stored procedures or the equivalent) where for each record type a function would anonymize the data and return the new record. Then we would start combining these procedures into modules which we could run a single function and specify the number of records to pull and recreate in our destination. We also would always write a special use case procedure where if some records we found broke production in the past we would always create those records in the dev/test data to make sure new code doesn't fail. This worked really well and yes, takes time initially to setup and takes some time to maintain, but it means you can reproduce a meaningful dataset into any environment for testing or development quickly and automated.

The best one I was involved with we had a SQL Server backend and so we wrote a C#.NET component to take place of the majority of the stored procedures, which made it faster and let us use RAM more efficiently to handle some of the translations & lookups. The same could be done in Postgres, MySQL, Mongo or just about any database in a language of your choosing for the most part.

For what it is worth, I have tried most database data tools out there that claim to solve this problem and never found them to be as good as spending a little time and building a custom toolset. Our .NET version (which could easily be done in Java or another language for Postgres) could be run from a stored procedure itself, so even repopulating staging environments etc could be done on a schedule and be 100% automated (plus gave DBA's comfort). We made the .NET one sophisticated enough it could import tables by enumerating the tables through SQL API's and then pull data from them generically. This process worked great for all the lookup tables and things that didn't need to have extensive translations. So as we got better more of the code was generic and we moved the translations and details to meta data about tables stored in SQL. In one case this worked against a database that was spread across 4 physical DB servers and ~700 tables and huge amounts of data. We could pull a reasonably sized local dev set and be working in roughly 20 minutes and that would include all the weird edge cases that had to be created. This was the closest we came to making something generic. For our shared staging environment we could use the same code and just specify a larger dataset, which typically took many hours to run but we'd just run it overnight. We also built in ways to pull specific client's dataset in if we were struggling with one client having unique issues.

FWIW too, currently at the startup I am in we are using primarily Postgres right now and I have been replicating a similar system to the .NET one I built before because we have strict requirements about the data given it is medical data. So we can't just let people have access to production or let people copy data to staging or local machines, as it would create a compliance and legal nightmare. I tried a couple of tools recently that again said they could do this, but the amount of time & money to invest in those tools and the features you still have to supplement just doesn't make sense to me. I'd much rather buy than build in most cases but this is one where it is hard to do that.