R – How to generate a set of ranges from the first letters of a list of words in Perl

algorithmperlrange

I'm not sure exactly how to explain this, so I'll just start with an example.

Given the following data:

Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry

I want to generate an index based on the first letter of my data, but I want the letters grouped together.

Here is the frequency of the first letters in the above dataset:

Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:

A B C D-G H-O P Q-Z

Clicking the "D-G" link would show:

Elderberry
Grapefruit
Grapes

In my range listing above, I am covering the full alphabet – I guess that is not completely neccessary – I would be fine with this output as well:

A B C E-G K-N P R-S

Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.

I am not too worried about lopsided data either – so if I 40% of my data starts with an "S", then S will just have its own link – I don't need to break it down by the second letter in the data.

Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers – it is guaranteed to start with a letter from A-Z.

I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this – I'm not sure what this method is called.

Here is what I started with:

#!/usr/bin/perl

use strict;
use warnings;

my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};

open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";

while ( my $item = <$DATASET> ) {
    chomp($item);
    my $first_letter = uc( substr( $item, 0, 1 ) );
    $index_frequency->{$first_letter}++;
}

foreach my $letter ( sort keys %{$index_frequency} ) {
    if ( $index_frequency->{$letter} ) {

        # build $ranges here
    }
}

My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined – my code gets very messy very fast.

Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess – I can convert it to Perl.

Thanks in advance!

Best Answer

Basic approach:

#!/usr/bin/perl -w
use strict;
use autodie;

my $PAGE_SIZE = 3;
my %frequencies;

open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
    next unless $l =~ m{\A([a-z])}i;
    $frequencies{ uc $1 }++;
}
close $fh;

my $current_sum = 0;
my @letters     = ();
my @pages       = ();

for my $letter ( "A" .. "Z" ) {
    my $letter_weigth = ( $frequencies{ $letter } || 0 );

    if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
        if ( $current_sum ) {
            my $title = $letters[ 0 ];
            $title .= '-' . $letters[ -1 ] if 1 < scalar @letters;
            push @pages, $title;
        }
        $current_sum = $letter_weigth;
        @letters     = ( $letter );
        next;
    }
    push @letters, $letter;
    $current_sum += $letter_weigth;
}
if ( $current_sum ) {
    my $title = $letters[ 0 ];
    $title .= '-' . $letters[ -1 ] if 1 < scalar @letters;
    push @pages, $title;
}

print "Pages : " . join( " , ", @pages ) . "\n";

Problem with it is that it outputs (from your data):

Pages : A , B , C-D , E-J , K-O , P , Q-Z

But I would argue this is actually good approach :) And you can always change the for loop into:

for my $letter ( sort keys %frequencies ) {

if you need.

Related Solutions

Python – How to generate all permutations of a list

There's a function in the standard-library for this: itertools.permutations.

import itertools
list(itertools.permutations([1, 2, 3]))

If for some reason you want to implement it yourself or are just curious to know how it works, here's one nice approach, taken from http://code.activestate.com/recipes/252178/:

def all_perms(elements):
    if len(elements) <=1:
        yield elements
    else:
        for perm in all_perms(elements[1:]):
            for i in range(len(elements)):
                # nb elements[0:1] works in both string and list contexts
                yield perm[:i] + elements[0:1] + perm[i:]

A couple of alternative approaches are listed in the documentation of itertools.permutations. Here's one:

def permutations(iterable, r=None):
    # permutations('ABCD', 2) --> AB AC AD BA BC BD CA CB CD DA DB DC
    # permutations(range(3)) --> 012 021 102 120 201 210
    pool = tuple(iterable)
    n = len(pool)
    r = n if r is None else r
    if r > n:
        return
    indices = range(n)
    cycles = range(n, n-r, -1)
    yield tuple(pool[i] for i in indices[:r])
    while n:
        for i in reversed(range(r)):
            cycles[i] -= 1
            if cycles[i] == 0:
                indices[i:] = indices[i+1:] + indices[i:i+1]
                cycles[i] = n - i
            else:
                j = cycles[i]
                indices[i], indices[-j] = indices[-j], indices[i]
                yield tuple(pool[i] for i in indices[:r])
                break
        else:
            return

And another, based on itertools.product:

def permutations(iterable, r=None):
    pool = tuple(iterable)
    n = len(pool)
    r = n if r is None else r
    for indices in product(range(n), repeat=r):
        if len(set(indices)) == r:
            yield tuple(pool[i] for i in indices)

How to count the number of set bits in a 32-bit integer

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86's popcnt (on CPUs where it's supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. C++20 std::popcount(), or C++ std::bitset<32>::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.

Portable algorithms that don't need (or benefit from) any HW support

A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.

If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. (https://godbolt.org/z/qGdh1dvKK)

int numberOfSetBits(uint32_t i)
{
     // Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
     i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
     return (i * 0x01010101) >> 24;          // horizontal sum of bytes
}

For JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i >> 1) & 0x55555555);

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)

References:

How this SWAR bithack works:

i = i - ((i >> 1) & 0x55555555);

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.

With full SIMD for wider vectors (e.g. counting a whole array)

This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

https://github.com/WojciechMula/sse-popcount state-of-the-art x86 SIMD popcount for SSSE3, AVX2, AVX512BW, AVX512VBMI, or AVX512 VPOPCNT. Using Harley-Seal across vectors to defer popcount within an element. (Also ARM NEON)
Counting 1 bits (population count) on large data using AVX-512 or AVX-2
related: https://github.com/mklarqvist/positional-popcount - separate counts for each bit-position of multiple 8, 16, 32, or 64-bit integers. (Again, x86 SIMD including AVX-512 which is really good at this, with vpternlogd making Harley-Seal very good.)