Data Structures – Most Efficient Way to Store Data

algorithmsdata structuresefficiencypseudocode

I am in charge of rewriting some old VB code. I understand how it works, but I feel like there is a far-more efficient way to do what they did. I just can't figure out what it is. Here is a contrived example that in terms of data requirements is really similar to what I need to do.

The user has to pick out the manufacturer, make, model and color of their car in a GUI. I have a large text file that looks something like this:

Ford Truck F150 red
Ford Truck F150 blue
Ford Truck F150 black
Ford Truck F150 silver
Ford Truck F250 red
Ford Truck F250 green
Ford Sedan Taurus red
Ford Sedan Taurus green
Ford Sedan Taurus white
Ford...
...

Subaru SUV Forester blue
Subaru SUV Forester red
Subaru SUV Outback Black
Subaru SUV Outback Green
Subaru SUV Outback Blue
Subaru SUV Outback Red
Subaru...
...

etc.

So if the first selection is Subaru, the second box (make) should not have an option to select Truck because none of the Subarus are trucks. Similarly, if they select Ford, Sedan and Taurus, then the last box (color) should not show an option to select blue. Or Black. Or anything other than red, green or white.

The people who wrote the code before me came up with this (in python-y psuedocode):

def getValidOptions():
    items = []
    for i from 0 to numRows:
        options = getLine().split()
        if selectingManufacturer:
            if options[0] not in items:
                items.append(options[0])
        else if selectingMake:
            if selectedManufacturer == options[0] and options[1] not in items:
               items.append(options[1])
        else if selectingModel:
            if selectedManufacturer == options[0] and selectedMake == options[1] and options[2] not in items:
                items.append(options[2])
        else if selectingColor:
            if selectedManufacturer == options[0] and selectedMake == options[1] and selectedModel == options[2] and options[3] not in items:
                items.append(options[3])
    return items

I think that is just hideous, both on an algorithm level, and on a syntax level. For one, it parses through the entire file, when it only needs to read through a couple of lines if done right. To make this even more inefficient, my real data has 6 options to select through, rather than just 4. This is also storing more data then it needs to, given the amount of data duplication.

I'm looking for either a different way of storing the data in the file, or a different way of parsing it to make the getValidOptions function both more pretty and more efficient. Are there any ways I could do this?

Best Answer

All the other answers I read seem to ignore two very basic rules of software development:

  • clarify the requirements first (especially the performance and storage requirements)

  • Keep It Simple, Stupid (see KISS)

You wrote "the text file is large", but did not write too large, so I assume there is actually nothing wrong with its size except your gut feeling. So if loading the file does actually not take too long, and your IT department or someone else does not complain about the wasted disk space, and nobody complains about having problems maintaining the file, let the file format as it is - do not underestimate the value of simplicity.

You are also complaining about the efficiency of the algorithm, which is actually not as efficient as it could be, but it has one very big advantage: it is brain-dead simple and works. So as long as it is efficient enough, do not apply any premature optimization.

So lets assume the program works fast enough, what you should ask first is how can you improve the code in terms of simplicity and the DRY principle? And that is indeed a point which should be improved, since the current code is not DRY. In your example, there appears four times the almost same code block testing if the options on the "higher levels" match the current line, which results in four times the same kind of "append" call (in your real code, the repetition occurs at least six times, as you wrote). You can easily avoid that by introducing a numeric selection level, or passing the already selected options in an ordered list. Using your pseudo-code, this leads to something along the lines of

# selectedOptions is a list, containing either nothing, or "selectedManufacturer"
# or [selectedManufacturer, selectedMake], ..., and so on
def getValidOptions(selectedOptions):
    items = []
    level = selectedOptions.size()
    for i from 0 to numRows:
        options = getLine().split()
        if selectedOptions == options[0:level-1] and options[level] not in item:
            items.append(options[level])
    return items

So this is essentially the same algorithm with no repeated code any more.

Since it is obvious getValidOptions has to be called more than once (at least once per level), I suggest to apply only one optimization (if this is not already the case): make sure the getLine function pulls its data from main memory, and does not read the file from disk again and again.

Related Topic