Web-development – Should I use BLOB or Tables for storing large data

Architectureblobperformancerdbmsweb-development

Problem

Currently investigating a solution to improve the performance of a web application. The application works well for small projects, but faces performance issues in the UI when working with large projects.

The use case is the following:

A user has to submit an excel document that contains 10000 line items. Each line item contains approximately 50 terms and each term can have one or more attributes.
The system should support a project that can handle 200 users uploading such documents.
A max of 10 users can concurrently be active. There can be multiple such large projects.

The database currently used is Oracle. We also need to ensure that the chosen solution will work well with an in-memory columnar RDBMS.

The existing functionality works well for small projects that has both a web UI and an excel interface. But the web UI has performance issues with large projects and we will solely rely on an excel interface.

The operations on the data involve upload/import, download/export, editing and reports.

All actions have to be transactional, since there are other updates within the RDBMS that occur as part of the upload. So this cannot be put in a non-transactional data source. There is at least one main operation where we need to load all the data. This operation can be done asynchronously.

Existing solution

Our existing solution which runs on tomcat and Oracle uses wide tables. This solution works well upto 1000 line items and then has performance issues on the application server.
The performance issues relate to java object hydration and causes memory issues on the application server. This is because the wide table has a large number of null columns and the java objects that are created are large due to a large number of empty fields.

Options

In order to handle a larger number of line items we need to reduce the memory footprint of the existing solution.
We are trying to decide between the following approaches:

  1. BLOB
  2. Narrow table
  3. Redesigned Java object (New)

BLOB solution

One way to avoid the null values is to transform the excel document into a concise key value format that can be compressed and stored in the database as a BLOB per user.
The advantage of this approach is:

  1. Use considerably less space in the DB.

The drawbacks are:

  1. We are limited in what we can do, since there are some operations
    that will need to process data across all users.
  2. A small edit will cause the whole BLOB to be rewritten and thus causing redo log
    growth.
  3. Will be difficult to retrofit the existing UI against this model in the future
  4. Maintain a new model for large projects

Narrow table

This approach solves the null values by having a few fields with a row for each term. The number of null columns is reduced drastically. The java objects hydrated from these rows do not have empty fields and can be small in size. So the memory issue is alleviated.
The advantages are:

  1. A narrow table is well suited for an in-memory columnar approach
  2. Keeps open the possibility of reworking the UI to work against the new table structure

The drawbacks are:

  1. Order of magnitude increase in the number of rows. A single project will end up having 10000x50x200 rows, i.e., 100 millions rows.
  2. Maintain a new model since the UI will not be touched and that will go off the old model.

Redesigned java class

Had not initially considered this approach but it looks like a good option.
We use the existing data model, but revamp our java class backed by a map. Only the populated fields are held in this map. This avoids having a class with a large number of fields and hence reduces the memory footprint for a sparsely populated object.

The advantage

  1. Solves the application memory issue with the least impact of all 3 options
  2. Uses the existing data model

Drawbacks

  1. Does not get rid of empty columns in the DB. But I think we can live with this for now.
  2. May not be the best format for a columnar in-memory RDBMS

Question

What is the best approach to take?


Update
As I was clearing up the description, a potential third option (Redesigned java class) dawned on me. So I am going to investigate it further as it looks promising with no model impact. Let me know if this is not a good option based on the use case and if you see any issues with it.

Best Answer

The challenge is how should this information be stored efficiently in an RDBMS?

The question should be why should this information be stored in a RDBMS at all?

What are you going to do with it once it's there?

If all you're going to do is "save" a spreadsheet into the database and then pull it back out again, then I'd suggest you're wasting your time. It's a file; put it in a file system where it belongs and from where you can [far more] easily retrieve it.

However ...

If you want to interrogate the "uploaded" data and "slice and dice" it, drawing summaries across the data uploaded by many users, then the database is most definitely the way to go.

OK, 100M rows is a lot but with proper indexing (and partitioning, if you have the option), your database will cope with it.

Related Topic