Data Processing – Right Way to Process Inconsistent Data Files

designetlpython

I'm working at a company that uses Excel files to store product data, specifically, test results from products before they are shipped out. There are a few thousand spreadsheets with anywhere from 50-100 relevant data points per file. Over the years, the schema for the spreadsheets has changed significantly, but not unidirectionally – in the sense that, changes often get reverted and then re-added in the space of a few dozen to few hundred files. My project is to convert about 8000 of these spreadsheets into a database that can be queried. I'm using MongoDB to deal with the inconsistency in the data, and Python.

My question is, what is the "right" or canonical way to deal with the huge variance in my source files? I've written a data structure which stores the data I want for the latest template, which will be the final template used going forward, but that only helps for a few hundred files historically. Brute-forcing a solution would mean writing similar data structures for each version/template – which means potentially writing hundreds of schemas with dozens of fields each. This seems very inefficient, especially when sometimes a change in the template is as little as moving a single line of data one row down or splitting what used to be one data field into two data fields.

A slightly more elegant solution I have in mind would be writing schemas for all the variants I can find for pre-defined groups in the source files, and then writing a function to match a particular series of files with a series of variants that matches that set of files. This is because, more often that not, most of the file will remain consistent over a long period, only marred by one or two errant sections, but inside the period, which section is inconsistent, is inconsistent.

For example, say a file has four sections with three data fields, which is represented by four Python dictionaries with three keys each.

For files 7000-7250, sections 1-3 will be consistent, but section 4 will be shifted one row down. For files 7251-7500, 1-3 are consistent, section 4 is one row down, but a section five appears. For files 7501-7635, sections 1 and 3 will be consistent, but section 2 will have five data fields instead of three, section five disappears, and section 4 is still shifted down one row. For files 7636-7800, section 1 is consistent, section 4 gets shifted back up, section 2 returns to three cells, but section 3 is removed entirely. Files 7800-8000 have everything in order.

The proposed function would take the file number and match it to a dictionary representing the data mappings for different variants of each section. For example, a section_four_variants dictionary might have two members, one for the shifted-down version, and one for the normal version, a section_two_variants might have three and five field members, etc. The script would then read the matchings, load the correct mapping, extract the data, and insert it into the database.

Is this an accepted/right way to go about solving this problem? Should I structure things differently? I don't know what to search Google for either to see what other solutions might be, though I believe the problem lies in the domain of ETL processing. I also have no formal CS training aside from what I've taught myself over the years. If this is not the right forum for this question, please tell me where to move it, if at all. Any help is most appreciated.

Thank you.

Best Answer

Most IT "patterns" deal with repeatable, reproducible situations. Yours is exactly the opposite. You've got truly dirty data: Not just dirty instances, but changing, inconsistent, and occasionally erroneous schema, varying over time. Oy vey!

I don't know of a canonical, widely-accepted "best practice" name for dealing with such an irregular dataset, but cleaning up a largely- but incompletely-structured set of files is a common ETL problem. I call the pattern I use The Spice Must Flow. If you want to be very fancy, it's a variation on Management by Exception, augmented with agile/incremental development and an understanding of successive approximation.

Good News / Bad News

The bad news: By definition, you don't know exactly which parts of the files are good, how they're structured, whether the individual data elements are internally consistent, or where errors may crop up.

But, there's plenty of good news: People who collect data usually intend for it to be useful, even if they're somewhat haphazard when recording it. Large swaths of your files will be comprehensible. You didn't provide examples of the data, but from your description, you have maybe at most 12 different schema? And most of those seem to be variations, rather than wholly new or alien entities.

The Process

First create parsers/handlers for one (or a few) of the larger swaths of data that you can readily handle. When you encounter parts of the dataset you can't yet handle, write those parts to "remainders" or "leftovers" files for subsequent handling. Ideally this determination can be made mechanically, by the program. Occasionally, it's human judgment that dictates some sections are wrong/different/need special handling. In those cases, you will have to more explicitly direct that some files/parts of files be processed or directed to leftovers.

Next write a parser for one or two of the odd cases you couldn't manage earlier. Run those, processing the original leftovers, creating a second-generation leftovers file. Iterate this "write parser/handler, run, create leftovers" process until you have no further leftovers.

It's appealing to think that you could create enough parsers/handlers to deal with your entire dataset in one fell swoop. In my experience, that's rarely feasible or desirable. The process is fundamentally one of discovery. Unless your dataset is very small and regular, you probably have not observed/understood all the possible quirks, variations, and errors up-front. Even if you know a lot about them, you probably won't catch them all. Theory says there are an infinite number of possible variations; while the practical variations are fewer, they're often confounding. An agile/adaptive approach that assumes from the start that you don't know all the possible variations/errors is more realistic and less frustrating. It also encourages you to add many more checks and assertions into your handlers, which usually lead to further discoveries about the data formatting and structure, and bring to light other possible errors. Management by exception is all about handling the cases you can, with the lowest overhead and highest automation possible, setting aside the exceptions for further scrutiny and higher-touch fixes.

As you go through this process, you may decide to commit the data you have successfully parsed to your final dataset, then incrementally extend your "clean" dataset with the subsequent processing passes. Alternatively, you may decide to use the processing passes as primarily exploratory steps, then run the whole shebang from scratch once you've achieved zero-leftovers. The decision will be based on how much data you have, whether it's possible and/or valuable to use initial data even in an incomplete form, and whether any pass's discoveries have invalidated parsing/processing assumptions made on previous passes. If you have a very large amount of data (thousands of files is getting there), and it is independently sequenceable (i.e., does not need to processed from start to finish to make sense of it), a more incremental approach is good. If, however, the items are not easily processed out of order, or there are later discoveries that invalidate earlier runs, re-runs from scratch become more necessary.

It is often the case in such projects that one does not create official schema for all the variations seen. The N different schema and variations get encoded into the parsing/handling code that translates existing data files into the new, cleaned file/database. That may not be ideal, but the general idea is that you'll create one clean new dataset, and then have a better, cleaner, and genuine schema for new additions to the dataset.

Related Topic