R – How to generate a set of ranges from the first letters of a list of words in Perl

algorithmperlrange

I'm not sure exactly how to explain this, so I'll just start with an example.

Given the following data:

Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry

I want to generate an index based on the first letter of my data, but I want the letters grouped together.

Here is the frequency of the first letters in the above dataset:

   2 A
   2 B
   3 C
   1 E
   2 G
   1 K
   1 M
   1 N
   4 P
   2 R
   1 S

Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:

A B C D-G H-O P Q-Z

Clicking the "D-G" link would show:

Elderberry
Grapefruit
Grapes

In my range listing above, I am covering the full alphabet – I guess that is not completely neccessary – I would be fine with this output as well:

A B C E-G K-N P R-S

Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.

I am not too worried about lopsided data either – so if I 40% of my data starts with an "S", then S will just have its own link – I don't need to break it down by the second letter in the data.

Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers – it is guaranteed to start with a letter from A-Z.

I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this – I'm not sure what this method is called.

Here is what I started with:

#!/usr/bin/perl

use strict;
use warnings;

my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};

open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";

while ( my $item = <$DATASET> ) {
    chomp($item);
    my $first_letter = uc( substr( $item, 0, 1 ) );
    $index_frequency->{$first_letter}++;
}

foreach my $letter ( sort keys %{$index_frequency} ) {
    if ( $index_frequency->{$letter} ) {

        # build $ranges here
    }
}

My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined – my code gets very messy very fast.

Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess – I can convert it to Perl.

Thanks in advance!

Best Answer

Basic approach:

#!/usr/bin/perl -w
use strict;
use autodie;

my $PAGE_SIZE = 3;
my %frequencies;

open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
    next unless $l =~ m{\A([a-z])}i;
    $frequencies{ uc $1 }++;
}
close $fh;

my $current_sum = 0;
my @letters     = ();
my @pages       = ();

for my $letter ( "A" .. "Z" ) {
    my $letter_weigth = ( $frequencies{ $letter } || 0 );

    if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
        if ( $current_sum ) {
            my $title = $letters[ 0 ];
            $title .= '-' . $letters[ -1 ] if 1 < scalar @letters;
            push @pages, $title;
        }
        $current_sum = $letter_weigth;
        @letters     = ( $letter );
        next;
    }
    push @letters, $letter;
    $current_sum += $letter_weigth;
}
if ( $current_sum ) {
    my $title = $letters[ 0 ];
    $title .= '-' . $letters[ -1 ] if 1 < scalar @letters;
    push @pages, $title;
}

print "Pages : " . join( " , ", @pages ) . "\n";

Problem with it is that it outputs (from your data):

Pages : A , B , C-D , E-J , K-O , P , Q-Z

But I would argue this is actually good approach :) And you can always change the for loop into:

for my $letter ( sort keys %frequencies ) {

if you need.