Php – Help parsing long (3.5mil lines) text file, line by line and storing data, need a strategy

designPHP

This is a question about solving a particular problem I am struggling with, I am parsing a long list of text data, line by line for a business app in PHP (cron script on the CLI). The file follows the format:

    HD: Some text here {text here too}

    DC: A description here
    DC: the description continues here
    DC: and it ends here.

    DT: 2012-08-01

    HD: Next header here {supplemental text}

    ... this repeats over and over for a few hundred megs

I have to read each line, parse out the HD: line and grab the text on this line. I then compare this text against data stored in a database. When a match is found, I want to then record the following DC: lines that succeed the matched HD:.

Pseudo code:

    while ( the_file_pointer_isnt_end_of_file) {
        line = getCurrentLineFromFile
        title = parseTitleFrom(line)
        matched = searchForMatchInDB(line)
        if ( matched ) {
            recordTheDCLines  // <- Best way to do this?
        }
    }

My problem is that because I am reading line by line, what is the best way to trigger the script to start saving DC lines, and then when they are finished save them to the database?

I have a vague idea, but have yet to properly implement it. I would love to hear the communities ideas\suggestions!

Thank you.

Best Answer

Separate the problem -- one script plows through and reads and stuffs the interesting stuff into some sort of data store. Second script pulls from the data store and processes the records. I suspect this will be much faster than doing it in the same script for no other reason than the 2nd script effectively multi-threads the app.

Related Solutions

Database vs Text File – When to Prefer Database Over Parsing Data from a Text File?

When should a database be preferred for storing the data over storing the data in a text file?

Wikipedia tells us that a database is an organized collection of data. By that measure, your text file is a database. It goes on to say:

The data are typically organized to model relevant aspects of reality in a way that supports processes requiring this information. For example, modeling the availability of rooms in hotels in a way that supports finding a hotel with vacancies.

That part is subjective -- it doesn't tell us specifically how the data should be modeled or what operations need to be optimized. Your text file consists of a number of distinct records, one for each day, so you're modeling an aspect of reality in a way that's relevant to your problem.

I realize that when you say "database" you're probably thinking of some sort of relational database management system, but thinking of your text file as a database changes your question from "when should I use a database?" to "what kind of database should I use?" Seeing things in that light makes the answer easier to see: use a better database when the one you've got no longer meets your requirements.

If your Python script and simple text file work well enough, there's no need to change. With only one new record per day and computers getting faster each year, I suspect that your current solution could be viable for a long time. A decade's worth of data would give you only 3650 records that, once parsed, would probably require less than 75 kilobytes.

Imagine that instead of one small record per day, you decided to record every question asked on CodeReview, who asked it, and when. Furthermore, you also collect all the answers and the relevant metadata. You could store all that in a text file, but a flat file would make it difficult to find information when you needed it. There'd be too much data to read the whole thing into memory, so whenever you wanted to find a question or answer, you'd have to scan through the file until you found what you were looking for. When you wanted to find all the questions asked by a given user, you'd have to scan through the entire file. If you wanted to find all the questions that have "bugs" as a tag, you'd have to scan through the file.

That'd be horribly slow, so you might decide to speed things up by building some indexes that tell you where to look in the file to find a given record. You could have an index for questions, another for users, a third for answers, and so on. When you wanted to find a question you'd search the (much smaller) question index, get the position of the question in the main data file, and jump quickly to the right spot in the file. That'd be a big performance improvement. Indeed, that's pretty much what a database management system is.

So, use a DBMS when it's what you need. Use it when you have a lot of data, when you need to be able to access that data quickly and perhaps in ways that you can't entirely predict at the outset. If you have different kinds of data -- different types of records -- that are connected to each other, use a RDBMS so that you can relate the various records appropriately.

Node.js – Preserving Pre-formatted Multi-Line Strings in Scripts

JavaScript now supports multi-line strings via the backtick sytax. For example:

const foo = `$query = <<<EOT
select
     field1
    ,field2
    ,field3
from tableName
where
    field1 = 123
EOT;`

References

Best Answer

Related Solutions

Database vs Text File – When to Prefer Database Over Parsing Data from a Text File?

Node.js – Preserving Pre-formatted Multi-Line Strings in Scripts

Related Topic