Algorithms – Techniques for Faster Performance Comparison

algorithmscomparisonperformance

I need to cross names from two lists, and find all occurrences of one name in the other. The lists are too big, one have 50k elements and the other 400k.

For a small list I would use two foreach cycles or Linq, but I can't be running the program for days.

What is your advice to perform a fast comparisons?

EDIT: In some cases I need to find more than occurrence on the second list, by other words, all the names are repeated and have different information associated. Then the intention is to merge the information from the two sources.

Best Answer

My suggestion would be to slurp the large list into a hash set, then use that to match items from the small list.

A hash set is a structure that stores elements in an indexable memory structure, like an array, where the position of the element is equal to some hash value calculated using the object. That means that looking for a value in the hashset is a relatively fast operation; calculate the hash of the object you're looking for, go to that index, and check the actual objects stored there, which for a good implementation will be a very small number (hashset implementations have to strike a balance between the size of the hash and therefore the number of first-dimension elements, and the number of collisions and therefore the average number of items in each element).

Ideally, hashsets approach a constant lookup time (specifically it's O(log_2^HN) where H is the bitsize of the hash function, so for all N < 2^H it's effectively constant), so overall, your matching algorithm would approach linear complexity. Two major downsides are first that unless you have access to a built-in efficient implementation (Java's HashMap is built on this structure, as is .NET's Dictionary class), you have to roll your own which is quite a bit of code, and second, hashsets are real memory hogs because there's virtually guaranteed to be a lot of empty spaces in the array unless your implementation varies its hash function based on expected or actual capacity (which could, if naively done, involve re-hashing every element several times as the first dimension is extended to limit growth in the second dimension).

Know your big-O

This should probably be merged into the lengthy discussion above. It's pretty much common sense that a loop inside of a loop, where the inner loop repeats a calculation, is gonna be slower. For example:

for (i = 0; i < strlen(str); i++) {
    ...
}

This will take a horrendous amount of time if the string is really long, because the length is being recalculated on every iteration of the loop. Note that GCC actually optimizes this case because strlen() is marked as a pure function.

When sorting a million 32-bit integers, bubble sort would be the wrong way to go. In general, sorting can be done in O(n * log n) time (or better, in the case of radix sort), so unless you know your data is going to be small, look for an algorithm that's at least O(n * log n).

Likewise, when dealing with databases, be aware of indexes. If you SELECT * FROM people WHERE age = 20, and you don't have an index on people(age), it'll require an O(n) sequential scan rather than a much faster O(log n) index scan.

Integer arithmetic hierarchy

When programming in C, bear in mind that some arithmetic operations are more expensive than others. For integers, the hierarchy goes something like this (least expensive first):

+ - ~ & | ^
<< >>
*
/

Granted, the compiler will usually optimize things like n / 2 to n >> 1 automatically if you're targeting a mainstream computer, but if you're targeting an embedded device, you might not get that luxury.

Also, % 2 and & 1 have different semantics. Division and modulus usually rounds toward zero, but it's implementation defined. Good ol' >> and & always rounds toward negative infinity, which (in my opinion) makes a lot more sense. For instance, on my computer:

printf("%d\n", -1 % 2); // -1 (maybe)
printf("%d\n", -1 & 1); // 1

Hence, use what makes sense. Don't think you're being a good boy by using % 2 when you were originally going to write & 1.

Expensive floating point operations

Avoid heavy floating point operations like pow() and log() in code that doesn't really need them, especially when dealing with integers. Take, for example, reading a number:

int parseInt(const char *str)
{
    const char *p;
    int         digits;
    int         number;
    int         position;

    // Count the number of digits
    for (p = str; isdigit(*p); p++)
        {}
    digits = p - str;

    // Sum the digits, multiplying them by their respective power of 10.
    number = 0;
    position = digits - 1;
    for (p = str; isdigit(*p); p++, position--)
        number += (*p - '0') * pow(10, position);

    return number;
}

Not only is this use of pow() (and the int<->double conversions needed to use it) rather expensive, but it creates an opportunity for precision loss (incidentally, the code above doesn't have precision issues). That's why I wince when I see this type of function used in a non-mathematical context.

Also, notice how the "clever" algorithm below, which multiplies by 10 on each iteration, is actually more concise than the code above:

int parseInt(const char *str)
{
    const char *p;
    int         number;

    number = 0;
    for (p = str; isdigit(*p); p++) {
        number *= 10;
        number += *p - '0';
    }

    return number;
}

How to estimate the entropy of a password

Appendix A on p46 of NIST SP 800-63 talks about the work of Claude Shannon, who estimates password entropy using a number of bits. Indeed, this is the document that the XKCD cartoon uses to calculate the entropy bits. Specifically:

the entropy of the first character is taken to be 4 bits;

the entropy of the next 7 characters are 2 bits per character; this is roughly consistent with Shannon’s estimate that “when statistical effects extending over not more than 8 letters are considered the entropy is roughly 2.3 bits per character;”

for the 9th through the 20th character the entropy is taken to be 1.5 bits per character;

for characters 21 and above the entropy is taken to be 1 bit per character;

A “bonus” of 6 bits of entropy is assigned for a composition rule that requires both upper case and non-alphabetic characters. This forces the use of these characters, but in many cases thee characters will occur only at the beginning or the end of the password, and it reduces the total search space somewhat, so the benefit is probably modest and nearly independent of the length of the password;

A bonus of up to 6 bits of entropy is added for an extensive dictionary check. If the attacker knows the dictionary, he can avoid testing those passwords, and will in any event, be able to guess much of the dictionary, which will, however, be the most likely selected passwords in the absence of a dictionary rule. The assumption is that most of the guessing entropy benefits for a dictionary test accrue to relatively short passwords, because any long password that can be remembered must necessarily be a “pass-phrase” composed of dictionary words, so the bonus declines to zero at 20 characters.

The idea is that an authentication system would pick certain entropy levels as thresholds. For example, 10 bits may be weak, 20 medium and 30 strong (numbers picked arbitrarily as an example, not a recommendation). Unfortunately, the document does not recommend such thresholds, probably because the computational power available to brute force or guess passwords increases over time:

As an alternative to imposing some arbitrary specific set of rules, an authentication system might grade user passwords, using the rules stated above, and accept any that meet some minimum entropy standard. For example, suppose passwords with at least 24-bits of entropy were required. We can calculate the entropy estimate of “IamtheCapitanofthePina4” by observing that the string has 23 characters and would satisfy a composition rule requiring upper case and non-alphabetic characters.

This may or may not be what you are looking for but is not a bad reference point, if nothing else.

[Edit: Added the following.]

The paper Testing Metrics for Password Creation Policies by Attacking Large Sets of Revealed Passwords (by Matt Weir, Sudhir Aggarwal, Michael Collins and Henry Stern) demonstrated the Shannon model, described above, is not an accurate model of entropy for human generated passwords. I would recommend looking at "Section 5 Generating New Password Creation Policies" for more accurate proposals.

Best Answer

Related Solutions

What simple techniques do you use to improve performance

Know your big-O

Integer arithmetic hierarchy

Expensive floating point operations

How to estimate the entropy of a password

Related Topic