insert premature-discussion-is-the-root-of-all-evil lecture
That said, here are some habits I've gotten into to avoid unnecessary efficiency, and in some cases, make my code simpler and more correct as well.
This isn't a discussion of general principles, but of some things to be aware of to avoid introducing unnecessary inefficiencies into code.
This should probably be merged into the lengthy discussion above. It's pretty much common sense that a loop inside of a loop, where the inner loop repeats a calculation, is gonna be slower. For example:
for (i = 0; i < strlen(str); i++) {
...
}
This will take a horrendous amount of time if the string is really long, because the length is being recalculated on every iteration of the loop. Note that GCC actually optimizes this case because strlen()
is marked as a pure function.
When sorting a million 32-bit integers, bubble sort would be the wrong way to go. In general, sorting can be done in O(n * log n) time (or better, in the case of radix sort), so unless you know your data is going to be small, look for an algorithm that's at least O(n * log n).
Likewise, when dealing with databases, be aware of indexes. If you SELECT * FROM people WHERE age = 20
, and you don't have an index on people(age), it'll require an O(n) sequential scan rather than a much faster O(log n) index scan.
Integer arithmetic hierarchy
When programming in C, bear in mind that some arithmetic operations are more expensive than others. For integers, the hierarchy goes something like this (least expensive first):
Granted, the compiler will usually optimize things like n / 2
to n >> 1
automatically if you're targeting a mainstream computer, but if you're targeting an embedded device, you might not get that luxury.
Also, % 2
and & 1
have different semantics. Division and modulus usually rounds toward zero, but it's implementation defined. Good ol' >>
and &
always rounds toward negative infinity, which (in my opinion) makes a lot more sense. For instance, on my computer:
printf("%d\n", -1 % 2); // -1 (maybe)
printf("%d\n", -1 & 1); // 1
Hence, use what makes sense. Don't think you're being a good boy by using % 2
when you were originally going to write & 1
.
Expensive floating point operations
Avoid heavy floating point operations like pow()
and log()
in code that doesn't really need them, especially when dealing with integers. Take, for example, reading a number:
int parseInt(const char *str)
{
const char *p;
int digits;
int number;
int position;
// Count the number of digits
for (p = str; isdigit(*p); p++)
{}
digits = p - str;
// Sum the digits, multiplying them by their respective power of 10.
number = 0;
position = digits - 1;
for (p = str; isdigit(*p); p++, position--)
number += (*p - '0') * pow(10, position);
return number;
}
Not only is this use of pow()
(and the int
<->double
conversions needed to use it) rather expensive, but it creates an opportunity for precision loss (incidentally, the code above doesn't have precision issues). That's why I wince when I see this type of function used in a non-mathematical context.
Also, notice how the "clever" algorithm below, which multiplies by 10 on each iteration, is actually more concise than the code above:
int parseInt(const char *str)
{
const char *p;
int number;
number = 0;
for (p = str; isdigit(*p); p++) {
number *= 10;
number += *p - '0';
}
return number;
}
Appendix A on p46 of NIST SP 800-63 talks about the work of Claude Shannon, who estimates password entropy using a number of bits. Indeed, this is the document that the XKCD cartoon uses to calculate the entropy bits. Specifically:
- the entropy of the first character is taken to be 4 bits;
- the entropy of the next 7 characters are 2 bits per character; this is roughly consistent with Shannon’s estimate that “when statistical
effects extending over not more than 8 letters are considered the
entropy is roughly 2.3 bits per character;”
- for the 9th through the 20th character the entropy is taken to be 1.5 bits per character;
- for characters 21 and above the entropy is taken to be 1 bit per character;
- A “bonus” of 6 bits of entropy is assigned for a composition rule that requires both
upper case and non-alphabetic characters. This
forces the use of these characters, but in many cases thee characters
will occur only at the beginning or the end of the password, and it
reduces the total search space somewhat, so the benefit is probably
modest and nearly independent of the length of the password;
- A bonus of up to 6 bits of entropy is added for an extensive dictionary check. If the
attacker knows the dictionary, he can avoid
testing those passwords, and will in any event, be able to guess much
of the dictionary, which will, however, be the most likely selected
passwords in the absence of a dictionary rule. The assumption is that
most of the guessing entropy benefits for a dictionary test accrue to
relatively short passwords, because any long password that can be
remembered must necessarily be a “pass-phrase” composed of dictionary
words, so the bonus declines to zero at 20 characters.
The idea is that an authentication system would pick certain entropy levels as thresholds. For example, 10 bits may be weak, 20 medium and 30 strong (numbers picked arbitrarily as an example, not a recommendation). Unfortunately, the document does not recommend such thresholds, probably because the computational power available to brute force or guess passwords increases over time:
As an alternative to imposing some arbitrary specific set of rules, an
authentication system might grade user passwords, using the rules
stated above, and accept any that meet some minimum entropy standard.
For example, suppose passwords with at least 24-bits of entropy were
required. We can calculate the entropy estimate of
“IamtheCapitanofthePina4” by observing that the string has 23
characters and would satisfy a composition rule requiring upper case
and non-alphabetic characters.
This may or may not be what you are looking for but is not a bad reference point, if nothing else.
[Edit: Added the following.]
The paper Testing Metrics for Password Creation Policies
by Attacking Large Sets of Revealed Passwords (by Matt Weir, Sudhir Aggarwal, Michael Collins and Henry Stern) demonstrated the Shannon model, described above, is not an accurate model of entropy for human generated passwords. I would recommend looking at "Section 5 Generating New Password Creation Policies" for more accurate proposals.
Best Answer
My suggestion would be to slurp the large list into a hash set, then use that to match items from the small list.
A hash set is a structure that stores elements in an indexable memory structure, like an array, where the position of the element is equal to some hash value calculated using the object. That means that looking for a value in the hashset is a relatively fast operation; calculate the hash of the object you're looking for, go to that index, and check the actual objects stored there, which for a good implementation will be a very small number (hashset implementations have to strike a balance between the size of the hash and therefore the number of first-dimension elements, and the number of collisions and therefore the average number of items in each element).
Ideally, hashsets approach a constant lookup time (specifically it's O(log2^HN) where H is the bitsize of the hash function, so for all N < 2^H it's effectively constant), so overall, your matching algorithm would approach linear complexity. Two major downsides are first that unless you have access to a built-in efficient implementation (Java's HashMap is built on this structure, as is .NET's Dictionary class), you have to roll your own which is quite a bit of code, and second, hashsets are real memory hogs because there's virtually guaranteed to be a lot of empty spaces in the array unless your implementation varies its hash function based on expected or actual capacity (which could, if naively done, involve re-hashing every element several times as the first dimension is extended to limit growth in the second dimension).