I was playing around with a PEG parser to do what you wanted (and may post that as a separate answer later) when I noticed that there's a very simple algorithm that does a remarkably good job with common forms of numbers in English, Spanish, and German, at the very least.
Working with English for example, you need a dictionary that maps words to values in the obvious way:
"one" -> 1, "two" -> 2, ... "twenty" -> 20,
"dozen" -> 12, "score" -> 20, ...
"hundred" -> 100, "thousand" -> 1000, "million" -> 1000000
...and so forth
The algorithm is just:
total = 0
prior = null
for each word w
v <- value(w) or next if no value defined
prior <- case
when prior is null: v
when prior > v: prior+v
else prior*v
else
if w in {thousand,million,billion,trillion...}
total <- total + prior
prior <- null
total = total + prior unless prior is null
For example, this progresses as follows:
total prior v unconsumed string
0 _ four score and seven
4 score and seven
0 4
20 and seven
0 80
_ seven
0 80
7
0 87
87
total prior v unconsumed string
0 _ two million four hundred twelve thousand eight hundred seven
2 million four hundred twelve thousand eight hundred seven
0 2
1000000 four hundred twelve thousand eight hundred seven
2000000 _
4 hundred twelve thousand eight hundred seven
2000000 4
100 twelve thousand eight hundred seven
2000000 400
12 thousand eight hundred seven
2000000 412
1000 eight hundred seven
2000000 412000
1000 eight hundred seven
2412000 _
8 hundred seven
2412000 8
100 seven
2412000 800
7
2412000 807
2412807
And so on. I'm not saying it's perfect, but for a quick and dirty it does quite well.
Addressing your specific list on edit:
- cardinal/nominal or ordinal: "one" and "first" -- just put them in the dictionary
- english/british: "fourty"/"forty" -- ditto
- hundreds/thousands:
2100 -> "twenty one hundred" and also "two thousand and one hundred" -- works as is
- separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot -- just define "next word" to be the longest prefix that matches a defined word, or up to the next non-word if none do, for a start
- colloqialisms: "thirty-something" -- works
- fragments: 'one third', 'two fifths' -- uh, not yet...
- common names: 'a dozen', 'half' -- works; you can even do things like "a half dozen"
Number 6 is the only one I don't have a ready answer for, and that's because of the ambiguity between ordinals and fractions (in English at least) added to the fact that my last cup of coffee was many hours ago.
I have found David Eppstein's find rational approximation to given real number C code to be exactly what you are asking for. Its based on the theory of continued fractions and very fast and fairly compact.
I have used versions of this customized for specific numerator and denominator limits.
/*
** find rational approximation to given real number
** David Eppstein / UC Irvine / 8 Aug 1993
**
** With corrections from Arno Formella, May 2008
**
** usage: a.out r d
** r is real number to approx
** d is the maximum denominator allowed
**
** based on the theory of continued fractions
** if x = a1 + 1/(a2 + 1/(a3 + 1/(a4 + ...)))
** then best approximation is found by truncating this series
** (with some adjustments in the last term).
**
** Note the fraction can be recovered as the first column of the matrix
** ( a1 1 ) ( a2 1 ) ( a3 1 ) ...
** ( 1 0 ) ( 1 0 ) ( 1 0 )
** Instead of keeping the sequence of continued fraction terms,
** we just keep the last partial product of these matrices.
*/
#include <stdio.h>
main(ac, av)
int ac;
char ** av;
{
double atof();
int atoi();
void exit();
long m[2][2];
double x, startx;
long maxden;
long ai;
/* read command line arguments */
if (ac != 3) {
fprintf(stderr, "usage: %s r d\n",av[0]); // AF: argument missing
exit(1);
}
startx = x = atof(av[1]);
maxden = atoi(av[2]);
/* initialize matrix */
m[0][0] = m[1][1] = 1;
m[0][1] = m[1][0] = 0;
/* loop finding terms until denom gets too big */
while (m[1][0] * ( ai = (long)x ) + m[1][1] <= maxden) {
long t;
t = m[0][0] * ai + m[0][1];
m[0][1] = m[0][0];
m[0][0] = t;
t = m[1][0] * ai + m[1][1];
m[1][1] = m[1][0];
m[1][0] = t;
if(x==(double)ai) break; // AF: division by zero
x = 1/(x - (double) ai);
if(x>(double)0x7FFFFFFF) break; // AF: representation failure
}
/* now remaining x is between 0 and 1/ai */
/* approx as either 0 or 1/m where m is max that will fit in maxden */
/* first try zero */
printf("%ld/%ld, error = %e\n", m[0][0], m[1][0],
startx - ((double) m[0][0] / (double) m[1][0]));
/* now try other possibility */
ai = (maxden - m[1][1]) / m[1][0];
m[0][0] = m[0][0] * ai + m[0][1];
m[1][0] = m[1][0] * ai + m[1][1];
printf("%ld/%ld, error = %e\n", m[0][0], m[1][0],
startx - ((double) m[0][0] / (double) m[1][0]));
}
Best Answer
There was already a question about this: Convert integers to written numbers
The answer is for C#, but I think you can figure it out.