I'm working on a data migration of several hundred nodes from a Drupal 6 to a Drupal 7 site. I've got the data exported to the new site and I want to check it. Harkening back to my statistics classes, I recall that there is some way to figure out a random number of nodes to check to give me some percentage of confidence that the whole process was correct. Can anyone enlighten me as to this practical application of statistics? For any given number of units, how big must the sample be to have a given confidence interval?
Migration Statistics – Checking Imported Data
migrationstatistics
Related Solutions
Data migrations are my bread and butter and data cleansing is indeed a hugely important matter. One strategy we use do migrate 100% of our customer's data is asymptotic data cleansing pre-migration tools.
This means developing tens of data-sanity checks (mostly sql queries).
Exchanging cleansing tools with the customer (since that's his data, we design the patching utilities, he validates them and executes them).
Refining the tool over iterations and reaching KPI-backed measurable quality asap.
Checking data consistency after the migration has happenned. This helps to make GO/NOGO decision on D-Day.
In the end a data migration is an immensely beneficial exercise that has to happen after 3 to 5 years.
It allows to boost the platform's ability to support business.
It allows to streamline the database.
It prepares the IT platform for next generation business tools (ESB/EAI, Portals, Self-Care platforms, reporting and data mining, you name it).
It reorganises DIY data flows between platforms that have accumulated over the years in a quick and dirty "temporary" way to fulfil "urgent requirements".
Above all it empowers the IT production team who come to know their platform better and foster 'can-do' attitudes. These kinds of benefits are difficult to measure but when you come to know many clients, this consideration becomes obvious. Companies shying away from migrations remain in the following tier, bold ones lead the pack.
It's a little bit like when the basement of your house becomes cluttered with lumber. One morning, you have to take everything out and put back only the things you need and throw the rest away. After that, you can use your basement again ;-)
Another fundamental consideration is that nowadays, customer expectations are always on the move, as in "customers are always more demanding". So that there will always be a significant proportion of a given company's competitors on the lookout for these new trends with the obvious intent to increase their market share. The way they will do so is by adapting their offering to stick to the trend or even drive the trends, and that entails constant business re-engineering. If your IT platform is too rigid, it will be a drag on your own aptitude to spouse or precede the market trends on your own side and, ultimately to maintain your own market share. In other words, in a moving market inertia is a recipe for irrelevance.
In contrast, a data migration to a newer system will roll out a more modern and more versatile productivity tool, making the best of newer technologies, more attractive to employees and this in turn, will contribute to support or even lead the company's internal innovation process, thereby securing or increasing its relative market share.
The considerations above actually answer only half of the question asked in the title "Data Migration - dangerous or essential". Yes Data Migrations are essential, but are they also dangerous ? On this account, many things in IT are dangerous then. By definition, anything where the stakes are high is dangerous; especially if you do not take the matter seriously. But this is actually the most common pattern in IT. Not taking data-centres or high availability or disaster tolerance seriously is dangerous.
Does that mean that today's companies should opt out of these pillars of today's Information Technology landscape ? Surely not !
To make your point jokingly, you could argue that "Flying is dangerous if you don't use a plane made by professionals". It's the same for Data Migrations. When executed and conducted by professionals, it is no more dangerous than flying in a well designed and well operated plane. And ROI is in the same proportion compared to terrestrial means of transport.
When entrusted to professionals, most migrations are well controlled successes and the failure+abandon rate is extremely low.
Your managers should be led to ask themselves "Whilst most companies go through Data Migration projects successfully what would make our company so different that it would instead experience a failure ? and can it fare well without one ?"
The Observer pattern might be a good fit here. The Transform
class defines a set of events that might be of interest to the statistics engine, and the statistics engine registers itself with the relevant Transform
instance to gather the statistics on that transformation (or the statistics that include that transformation).
Update
As stated in a comment, the basic problem is how does the statistics engine know that something of interest has happened.
You could execute the codec in a virtual machine to keep track of everything (valgrind uses this approach to check memory access), but then you have the problem of deciding what it means that the codec accessed address 0x12345678
.
All other methods of statistics gathering invade into the code-base of the codec in one way or another. The least invasive is probably to add copious amounts of logging and to let the statistics engine analyse that. All logging packages also provide means to disable the generation of the logs with minimal cost, sometimes even compiling those statements to no-ops.
Best Answer
I found this sample size calculator. For my population of 215 items, if I want a 95% confidence with +/- 5% confidence interval, I'll need to randomly sample 138 items.
Edit: Here's the actual formula that I was looking for.