Algorithms – Finding Groups of Common Relations

algorithms

I have a very large dataset with two columns of interest: Name and Cousin.

Each row has a unique name and each name has a cousin.

The data is mutually inclusive, so the cousin's name will also appear as another row in the data.

The effect is therefore a big dataset with lots of names of people who are related to each other.

The relations, however, form themselves into separate groups. These are groups that have no relation with any person in any other group.

E.g (using -> to denote a relation)

A->C->E->G = Group 1
B->D->F->H = Group 2

None of the people in Group 1 shares a relation with any person in Group 2.

So, I'm looking for an approach to find these groups of related people using only their names and relations. This could be some pseudo-code or a conceptual approach, anything to point me down the right path.

Example dataset:

Name Cousin  
A    C  
B    D 
C    E  
D    F  
E    G  
F    H  
G    E
H    K
I    M  
K    H  
M    I

This data would be aggregated into three groups: Group 1 (A,C,E,G); Group 2 (B,D,F,H,K); Group 3 (I,M).

Best Answer

You are building a set of graphs.

You start out and start walking the list of relationships you have gotten back.

A -> C

So, lets make a graph that represents that.

A - C 

And then you've got:

B -> D

Neither B nor D are represented in any of the graphs currently. So now you have:

A - C       

B - D

And then C -> E, so we hunt up C and link to to E. Now we've got:

A - C - E   

B - D       

And so on. That will give you the give you all of the related graphs.

"Fine," you say, "but that doesn't really get any closer to doing an implementation"

The real problem here is how to represent the graphs.

Lets write some pseudocode. I speak a perlish dialect - but its not perl.

$nextGroup = 1
for each ($key, $link) in (@row) {
    if(not defined $group{$key}) {
        $group{$key} = $nextGroup++;
    }
    if(not defined $group{$link} || $group{$link} == $group{$key}) {
        $group{$link} = $group{$key}
    } else {
        for each ($k, $v) in (entries %group) {
            if($v == $group{$link}) {
                $group{$k} = $group{$key};
            }
        }
    }
}

What is it doing? It iterates over each row returned. If the key isn't a known item, it assigns it to a new group. It then assigns the group the key belongs to to the link. If the link already has a different value, then it finds all the links with the same value as it and assigns them to the value of the key.

So, we start out again.

A = 1 (nextGroup = 2)
  C = 1
B = 2 (nextGroup = 3)
  D = 2
C is 1
  E = 1
D = 3 (nextGroup = 4)
  F = 3
E is 1
  G = 1
...

At the end of this, the groups map/dictionary/hash (whatever your language calls it) is going to have each individual in there with a corresponding value. All entries with the same value belong to the same graph being depicted.

Related Topic