Data Parsing – Approach for Parsing and Indexing Very Large Files

dataindexingparsing

I have been tasked with developing a web based (i.e runs in browser) viewer for a proprietary log file.

I have no control over the format of the logs, I just consume them. The log file contains binary data appended by a text message on each line, so part of each line must be de-serialized on read.

What I am posing, is what is the preferred approach here to quickly read the file and make it available for searching and page of text retrieval.

My approaches:

I have a web service read the file and bulk insert its contents

  1. Into a SQL server installation
  2. Into a serverless no-SQL like LiteDb
  3. Into a serverless Sqlite instance with one sqlite file per log file.

The data is wiped after that log file data has not been viewed for a week.

The problem is these approaches works fine on files that are less than 100mb but quickly breaks down with log files that sometime grow to be 2gb.

Even on a 2 gb file, parsing is relatively fast and seemingly not the bottleneck.
On the serveless options, writing the parsed data is also relatively quick

Query of full text is not great on any options, though once an index as has time to build on the Sql Server, it's decent.

I am wondering if anyone has any advice / experience with this sort of project – basically, quickly de-serialize a large file and bulk insert it into some sort of data store (maybe an in-memory data-store would be better) and make it available for interaction via a web service.

CLARIFICATIONS:

@Basile – Sorry, I didn't mean to imply that, I just wanted to make sure I was looking down the best avenue and had the best suited solution

@Basile – The log files are dead simple, just a 45 byte binary blob that I read as bytes and copy to a struct (in .NET), then I read the rest of the text until a new line char. Repeat until EOF.

@DocBrown Sorry, I read this, thought it was appropriate see here

@Christophe I don't have 3 DBs, those are just the three approaches I've tested so far.

Best Answer

I don't think it is a great idea to preload 2GB of log data. That would make for an unbearable user experience, and if you run this thing on a production server you will set off a bunch of alarms in the NOC.

I would focus on keeping a small memory footprint and reading as little of the file as possible. There are ways to search the file without actually loading it all. Some common use cases:

User wants to see log data from a certain portion of the file, as indicated by dragging a scrollbar.

  1. Open a file stream on the file and compute the target of a Seek operation by multiplying [total file length] * [scrollbar percentage].
  2. Seek the target location
  3. Read until the next newline character; discard
  4. Start reading and displaying records until the screen is full

User wants to see log data for a certain time period

  1. Open the file stream on the file
  2. Divide the file in half; seek the halfway point
  3. Skip to the next newline character, then read the next record
  4. Parse the date/time stamp.
  5. If the date/time is just right, stop
  6. If the date/time is too low, repeat the above operations on the second half of the file (so in step 2 you're actually finding the 3/4th mark)
  7. If the date/time is too high, repeat the above operations on the first half (so in step 2 you're actually finding the 1/4th mark)
  8. Keep recursing until you find the time range that is desired. This is known as a binary search.
Related Topic