Design Pattern for Complex Data Modeling

database-designdesign-patternsrelational-database

I'm developing a program that has a SQL database as a backing store. As a very broad description, the program itself allows a user to generate records in any number of user-defined tables and make connections between them. As for specs:

  1. Any record generated must be able to be connected to any other record in any other user table (excluding itself…the record, not the table). Yes, this creates a arbitrary user-generated relational map.
  2. These "connections" are directional: Incoming, outgoing, and bi-directional.
  3. The list of connections a record has is user ordered.
  4. A record must "know" of connections made from it to others as well as connections made to it from others.
  5. A record's field can also include aggregate information from it's connections (like obtaining average, sum, etc) that must be updated on change from another record it's connected to.
  6. To conserve memory, only relevant information must be loaded at any one time (can't load the entire database in memory at load and go from there).
  7. I cannot assume the backing store is local. Right now it is, but eventually this program will include syncing to a remote db.
  8. I cannot depend on a specific implementation of a backing store. This means I cannot be certain the backing store is, in fact, relational. Thus all relational data is stored explicitly.

The connections are kind of the point of this program, so there is a strong possibility that the number of connections made is very high, especially if the user is using the software as intended. Neither the user tables, connections or records are known at design time as they are user generated.

I've spent a lot of time trying to figure out how to design the backing store and the object model to best fit these specs. In my first design attempt on this, I had one object managing all a table's records and connections. I attempted this first because it kept the memory footprint smaller (records and connections were simple dicts), but maintaining aggregate and link information between tables became….onerous (ie…a huge spaghettified mess). Tracing dependencies using this method almost became impossible. Instead, I've settled on a distributed smart graph where each record and connection is 'aware' of what's around it by managing it own data and connections to other records. Doing this increases my memory footprint but also let me create a faulting system so connections/records aren't loaded into memory until they're needed. It's also much easier to code: trace dependencies, eliminate cycling recursive updates, etc. My biggest problem is storing/loading the connections. I'm not happy with any of my current solutions/ideas so I wanted to ask and see if anybody else has any ideas of how this should be structured. Connections are fairly simple. They contain: fromRecordID, fromTableID, fromRecordOrder, toRecordID, toTableID, toRecordOrder. Here's what I've come up with so far:

  1. Store all the connections in one big table. If I do this, either I load all connections at once (one big db call) or make a call every time a user table is loaded. The big issue here: the size of the connections table has the potential to be huge, and I'm afraid it would slow things down.
  2. Store in separate tables all the outgoing connections for each user table. This is probably the worst idea I've had. Now my connections are 'spread out' over multiple tables (one for each user table), which means I have to load each table just to find all the incoming connections for a particular user table. I've avoided making "one big ass table", but I'm not sure the cost is worth it.
  3. Store in separate tables all outgoing AND incoming connections for each user table (using a flag to distinguish between incoming vs outgoing). This is the idea I'm leaning towards, but it will essentially double the total storage for all the connections (as each connection will be stored in two tables). It also means I have to make sure connection information is kept in sync in both places (each 'from' connection has a corresponding 'to' connection in the appropriate table. This is obviously not ideal but it does mean that when I load a user table, I only need to load one 'connection' table and have all the information I need. This also presents a separate problem, that of connection object creation. Since each user table has a list of all connections, there are two opportunities for a connection object to be made. However, connections objects (designed to facilitate communication between records) should only be created once. This means I'll have to devise a common caching/factory object to make sure only one connection object is made per connection.

Does anybody have any ideas of a better way to do this? Once I've committed to a particular design pattern I'm pretty much stuck with it, so I want to make sure I've come up with the best one possible.

I'll also note that my current backing store is SQLite. I've written the program to not depend on a specific backing store, though, by introducing an adapter layer to disassociate the data retrieval from the the rest of the program. So the backing store could be anything, even a simple text file. I use the backing store exclusively for storage and data retrieval, nothing else. I don't use any specific database/SQL relational features and I don't use the database to perform calculations (aggregate or otherwise). All of that is handled in program logic.

Best Answer

All your requirements, save for #8, are implemented by relational databases (consider INFORMATION_SCHEMA, which is a standard relational database feature- the information schema contains views which describe the tables in the database, which columns they have, etc.- your app would work a lot with the INFORMATION_SCHEMA data).

Thanks to #8, you have basically cornered yourself into writing your own relational database engine.

It sounds like you are implementing Access which... implements its own relational database (Jet); it also sucks because implementing a relational engine is quite hard (more if you are subject to the constraints that Access has), and in the end, Access doesn't really make hard things easy.

Implementing a relational database engine on top of another datastore might simplify things, of course (i.e. Berkeley DB is useful for implementing an RDBMS, and Berkeley DB is everywhere. See a related answer I gave recently), but I would think hard about requirement #8 and the viability of your project.

Related Topic