Content-Based ETL Process – Designing with .NET and SFDC

application-designArchitecturecetlnet

As my firm makes the transition to using SFDC as our main operational system, we've spun together a couple of SFDC portals where we can post customer-specific documents to be viewed at will. As such, we've had the need for pseudo-ETL applications to be implemented that are able to extract metadata from the documents our analysts generate internally (most are industry-standard PDFs, XML, or MS Office formats) and place in networked "queue" folders. From there, our applications scoop of the queued documents and upload them to the appropriate SFDC CRM Content Library along with some select pieces of metadata. I've mostly used DbAmp to broker communication with SFDC (DbAmp is a Linked Server provider that allows you to use SQL conventions to interact with your SFDC Org data).

I've been able to create [console] applications in C# that work pretty well, and they're usually structured something like this:

static void Main()
{
    // Load parameters from app.config.
    // Get documents from queue.
    var files = someInterface.GetFiles(someFilterOrRegexPattern);

    foreach (var file in files)
    {
        // Extract metadata from the file.
        // Validate some attributes of the file; add any validation errors to an in-memory          
        // structure (e.g. List<ValidationErrors>).
        if (isValid) 
        {
            var fileData = File.ReadAllBytes(file);
            // Upload using some wrapper for an ORM or DAL
            someInterface.Upload(fileData, meta.Param1, meta.Param2, ...);
        }
        else
        {
            // Bounce the file
        }
    }

    // Report any validation errors (via message bus or SMTP or some such).
}

And that's pretty much it. Most of the time I wrap all these operations in a "Worker" class that takes the needed interfaces as constructor parameters.

This approach has worked reasonably well, but I just get this feeling in my gut that there's something awful about it and would love some feedback. Is writing an ETL process as a C# Console app a bad idea? I'm also wondering if there are some design patterns that would be useful in this scenario that I'm clearly overlooking.

Thanks in advance!

Best Answer

Well...just focusing on the practical aspects of the general ETL problem, without too much additional effort you could separate the backing logic into one or more reference-able assemblies, and create a simple windows service to call into your ETL process in an identical way.

It's trivial to set up a windows service application to also be runnable via console (via a command line argument) so you don't lose the ability to run it as a console if you want to.

In return you gain the ability to run the ETL process as a windows service; using a System.Threading.Timer or scheduling mechanism of your choice within the windows service object gives you an automated / scheduled solution that should rarely require manual intervention.

If your ETL process is working already as a console app, it should only take you somewhere between a few hours to a couple of days (depending on how familiar you are w/ windows services and if you want to embellish the solution w/ extra bells and whistles) to evolve it into a windows service and start testing it.

As the actual ETL logic itself is independently available from this means of running it you can also drive the logic from other applications, such as a workflow or from one or more GUI's.

Console apps are handy for convenient "utility" dev tools, one off batch process, command line style UI's (a rare thing these days), to get something off the ground, or to play around with some logic w/o worrying about wiring up a GUI, and so forth.

But mission critical, repetitive, and or operational processes deserve a more robust mechanism of execution / interaction, in my opinion.

From a "Patterns" perspective, more logic and context would be necessary to make a real recommendation, but from a general ETL perspective the Pipeline aka Pipes and Filters pattern is often a key consideration...you may already have a simple version of it set up even if you didn't deliberately set out to implement it as it's very natural to this kind of problem.

If you are ETL'ing a lot of data, you probably want to consider splitting it up into pieces that you can do parallel processing on...but that leads into a bigger discussion.

Hope something in there helps...

Related Topic