How Test Data Builders Help You Sleep

This isn’t a full-on sales pitch for data builders. Rather, it is only a testimonial of how they make my life easier.

Some developers have to start with an empty database. Others have some minimally-populated db that they add to on an as-needed basis. Still others can use a scrubbed copy of production data, which is excellent for judging what the execution plans will be.

If you’re in one of those first 2 scenarios, you maybe have already built scripts to create the test data you need. Perhaps you have a generic script to create a single record for a given table and then miscellaneous UPADTEs to make each new row fit what you need. If you have production-grade data, you more likely have SELECT scripts to locate the data you need for a particular test.

Unfortunately, what all of that has is common is that none of those approaches are manageable or consistent enough to support a suite of automated test cases for any development team with more than a couple of people. Let’s tackle this by looking at how data builders can help you out here.


This isn’t about the ACID principles, it’s about staying DRY (Don’t Repeat Yourself).

At the lowest level of the stack, the values that can be placed into a test row of data are constrained by what the table allows. Data types, foreign keys, nullability, check constraints, etc. You may have business logic that resides in your procs or application code that adds to what is allowed in. But you need to resign yourself to the fact that the table’s constraints are the final arbiter as to what’s allowed, even if you don’t expect particular data values to end up in there.

What the data builder does for you here is allow you to define what is and isn’t needed for inserting data by creating a single pinch point from which you can react to any changes to the underlying table. Consider the alternative: you have multiple “scaffolding” scripts – probably embedded in various test cases – which would all need to be modified if the table gets a new non-nullable, non-defaulted column. With a data builder in place, you only need to account for the new column there and the rest of your test code can continue on its merry way. The only places where you have to make any additional edits are for test cases that specifically deal with this new column.


Remember from my intro post on test data builders that you only need to supply values for the columns that are material to the test at hand. Most production INSERT procs still expect you to provide valid data for most of the parameters. They just save you the trouble of having to keep the INSERT statement updated with table changes.

By removing irrelevant parameters from test code, you can immediately see which data elements are relevant to a test method.


If you have some pet generator scripts for complex objects or are using “found data” for your tests, creating builders for that same complex data may seem burdensome. It doesn’t have to be.

First, consider that when you are using the other methods of making your test data, you’re still committing to supplying all those values and maintaining all of those INSERTs. Do the same work in a test data builder, and you will have less code to edit: only 1 “script”/builder per table. If you need any easy way to make a complex object, create a wrapper proc that, in one call, produces an entity and all of its requisite “child” data.

Following this approach allows you to ignore any child details that you don’t care about. For example, you want to make test orders that have been paid for with a credit card (instead of a gift card). Each payment method used has its own business rules about refunds. You need this test data so you can later assert that the dollar amount on the order available for refund is valid.

Put a data builder in place for the whole order which only takes in a payment method and an item to purchase. What this upper-level builder will do is call all of the necessary table-level builders to make sure you not only get a correct order header record, but order detail, payment processing, shipping information, etc. that the application would need to have in place for the code under test to fully operate.

With this higher-level builder, you only had to specifiy the details you need. And since it’s a data builder that uses other data builders, all parameters are optional, which allows for maximum reuse. Sure this breaks the single pinch point concept for dealing with structural table changes. But even half-decent code analysis tools can help you out with finding dependency chains when it’s time to refactor.


The problem with found data for testing is that it can run out. We run an ecommerce platform and in order to keep our databases as lean as possible, we remove all data older than N days from the front-end databases, leaving the information in an archive area should it be needed.

Sure, with any successful business, there should be enough orders paid by credit card to keep test code happy in between data refreshes to non-prod systems. But what if you have to keep rerunning a test or have a long set of test cases that require data which has 9 criteria elements? Even if you find a handful of these “Chupacabra” rows, you will likely use them up before your next data refresh.

Another problem is if you do most of your sales around a holiday, what are the odds that you have the situation occur in the opposite end of the calendar year? This type of data skew can happen with other dimensions besides time. It could be the data collected by different types of sensors, the country you’re doing business in, etc.

You may have to simulate the result of multiple, asynchronous activities. To do this by triggering each step in an accelerated fashion can be tricky. To do this in support of a db unit test can be close to impossible. This is because a lot of these activities are carried out in application code. In order to go down this road, you’d have to figure out how to call/trigger that application code from T-SQL.

With a test data builder, missing or hard-to-reproduce data isn’t a problem. You know what qualities the test data for these cases need to have and you make it. No more troubleshooting a test failure to only find out that it ran out of data.


Not the “I” in ACID. This is more about the “P” in “Pain reliever”. Just kidding. The “P” is for “Parallel execution.”

Whether you use production data or test data generated outside the scope of the test code, it’s still “found data”. The number one approach to speeding up the execution of automated test cases is to run them in parallel. With found data, you now have to worry that 2 tests which modify records will find the same test data and cause a failure for the test case because they both tried to operate on the same records.

With each test case leveraging a test data builder, the data it needs is generated within the “scope” of the test code. That means you don’t have to require that tests execute in serial or worry that parallel execution will see intermittent failures.

Some data is NOT on an island

There are 2 classes of data that don’t work this way: queuing and reporting.

The “get work to do” procs for handling a queuing table usually work off the principle that “I want the next N records WHERE ready = true and unassigned = true” (I’m paraphrasing). That WHERE clause doesn’t usually have anything to do with the primary key values of the rows you just created nor the queue item’s payload that you tailored for the test case.

When you run an integration test, it’ll just pick up the next N rows, regardless of which test case created the data, now all of the custom-crafted data each test case made was picked up by 1 of those executions and the rest are left with an empty result set.

The only real solution is to run those particular tests in serial. Typically, these don’t represent tests which are a large portion of the test suite. Running them serially shouldn’t be a big performance hit. Also, such tests can be included in only nightly runs rather than per-commit runs.

For reporting, the test assertions usually follow a pattern of “when the table contains X, the report should show Y”. If you happen to be able to assert at a granularity you can control, such as reporting on a single customer’s orders, then data builders will work just fine. However, if your reports are aggregating larger sets of data, your only real choice is to have a copy of the database dedicated for report testing. This allows you to have what is mostly a read-only database for report testing and a read-write system for most of the application and stored proc testing. I’m using “report” in a broad sense here, not just SSRS reports.

Edge Cases

Let’s say your application already does ALL of its data access via stored procedures. You may then ask “why wouldn’t I just use my existing INSERT procs to create test data?” Well, for application integration tests, that may be a way to go. The limitation with that approach shows up when you attempt to perform database unit tests or need to assert failures on dirty data.

Common QA practice is to create test cases that ensure fixed bugs remain fixed (no regressions down the line). Let’s say you had a bug which allowed bad data into your table. Now, you need to test the existing procs to know that they respond correctly should they encounter bad data. Sure, you ran a “fix it” script, but can you guarantee that it’ll never happen again? Of course not, even if it’s unlikely. That’s why you run automated tests: to make sure you account for situations that have occurred (99% of the time).

You fixed your INSERT proc by adding some conditional logic to respond to the bad value. You need to make sure that your SELECT and UPDATE procs behave correctly in the face of bad data. The data builder doesn’t have that protection logic. It just creates what you need for the test. Without it, you’re back to using your production INSERT proc, followed by a custom UPDATE meant to dirty up the data. Taking 2 steps instead of 1 just to set up the data you need isn’t the way to go here.

What About EF?

I definitely don’t want to get into the Entity Framework vs. stored proc religious argument here. Yes, Entity Framework can operate as a data builder of sorts because of its ability to read the table metadata and create INSERT methods from that. However, you still have to provide default values for required columns, just as you would with a SQL-based data builder proc. On this point, both options work out to a similar level of effort.

EF doesn’t really help you when creating db unit tests. There’s a better than average chance that if you’re writing db unit tests, you’re implementing the tests in T-SQL, regardless of whether you’re using SSDT SQL Unit tests, tSQLt, T.S.T., or some other test framework. Calling EF methods from T-SQL code is cumbersome just to do something T-SQL can handle well anyway.

Lastly, if you’re working on database team that isn’t as familiar with EF/.Net, the long term success of your test code will be harder to achieve.

Thanks for stopping by. I do enjoy hearing from you...