Regexp/perl code for handling both dots and commas as valid decimal separators

perlregex

I'm trying to create a method that provides "best effort" parsing of decimal inputs in cases where I do not know which of these two mutually exclusive ways of writing numbers the end-user is using:

  • "." as thousands separator and "," as decimal separator
  • "," as thousands separator and "." as decimal separator

The method is implemented as parse_decimal(..) in the code below. Furthermore, I've defined 20 test cases that show how the heuristics of the method should work.

While the code below passes the tests it is quite horrible and unreadable. I'm sure there is a more compact and readable way to implement the method. Possibly including smarter use of regexpes.

My question is simply: Given the code below and the test-cases, how would you improve parse_decimal(…) to make it more compact and readable while still passing the tests?

Clarifications:

  • Clarification #1: As pointed out in the comments the case ^\d{1,3}[\.,]\d{3}$ is ambiguous in that one cannot determine logically which character is used as thousands separator and which is used as a decimal separator. In ambiguous cases we'll simply assume that US-style decimals are used: "," as thousands separator and "." as decimal separator.
  • Clarification #2: If you believe that any of test cases is wrong, then please state which of the tests that should be changed and how.

The code in question including the test cases:

#!/usr/bin/perl -wT

use strict;
use warnings;
use Test::More tests => 20;

ok(&parse_decimal("1,234,567") == 1234567);
ok(&parse_decimal("1,234567") == 1.234567);
ok(&parse_decimal("1.234.567") == 1234567);
ok(&parse_decimal("1.234567") == 1.234567);
ok(&parse_decimal("12,345") == 12345);
ok(&parse_decimal("12,345,678") == 12345678);
ok(&parse_decimal("12,345.67") == 12345.67);
ok(&parse_decimal("12,34567") == 12.34567);
ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);
ok(&parse_decimal("12.345,67") == 12345.67);
ok(&parse_decimal("12.345.678") == 12345678);
ok(&parse_decimal("12.34567") == 12.34567);
ok(&parse_decimal("123,4567") == 123.4567);
ok(&parse_decimal("123.4567") == 123.4567);
ok(&parse_decimal("1234,567") == 1234.567);
ok(&parse_decimal("1234.567") == 1234.567);
ok(&parse_decimal("12345") == 12345);
ok(&parse_decimal("12345,67") == 12345.67);
ok(&parse_decimal("1234567") == 1234567);

sub parse_decimal($) {
    my $input = shift;
    $input =~ s/[^\d,\.]//g;
    if ($input !~ /[,\.]/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d,\d+\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d\.\d+,\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d\.\d+\.\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d,\d+,\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d{4},\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d{4}\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d,\d{3}$/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d\.\d{3}$/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d,\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } else {
        return &parse_with_separators($input, '.', ',');
    }
}

sub parse_with_separators($$$) {
    my $input = shift;
    my $decimal_separator = shift;
    my $thousand_separator = shift;
    my $output = $input;
    $output =~ s/\Q${thousand_separator}\E//g;
    $output =~ s/\Q${decimal_separator}\E/./g;
    return $output;
}

Best Answer

This is like programs that automatically guess the character encoding of the input -- it might work sometimes, but in general is a very bad strategy that leads to buggy and confusing non-deterministic behavior.

For example, if you see "123,456", you don't have enough information to guess what that means.

So I would approach this with caution, and never use this technique for anything important.

Related Topic