Algorithms – Generating Hierarchical Structure from Relational Data

algorithmsdata structuresparsing

I have a csv of employee ids, names, and a reference column with the id of their direct manager, say something like this

 emp_id, emp_name, mgr_id
1,The Boss,,
2,Manager Joe,1
3,Manager Sally,1
4,Peon Frank,2
5,Peon Jill,2
6,Peon Rodger,3
7,Peon Ralph,3

I would like to be able to generate a (json) object representing this structure, something along the lines of

DATA = {
  "parent": "The Boss",
  "children: [
    {
      "parent": "Manager Joe",
      "children": [ {"parent": "Peon Frank"}, {"parent": "Peon Jill"} ]
    },
    {
      "parent": "Manager Sally",
      "children": [ {"parent": "Peon Rodger"}, {"parent": "Peon Ralph" } ]
    }]
}

So from the data, an entry with no mgr_id represents something like the CEO or Leader.

So just collecting some thoughts.. I know this could be represented by some tree data structure with the tree traversal generating the correct output. The parser would have to be responsible for inserting into the tree. Maybe the number of children would give you a weight? Just gathering thoughts. Not sure how I would pivot around the fact that multiple objects can be constructed with children.

Is there an algorithm defined that can parse this structure I am not thinking of? Descending down into the children seems relatively straightforward. I am not seeing this in my head, and could use some help. Thank you

Best Answer

There are several ways to create structure from flatland. Recursion is one of the best. This program uses it, with a preliminary step of figuring out which items are parents.

While the strategy is language-agnostic, every language has different details of data structures and building blocks. I've rendered the approach in Python.

NB, about half this code is for display and demonstration purposes, so you can follow along and see how the algorithm is working. For production, fee free to remove those parts.

Oh...and I changed the labels from 'parent' to 'name' and 'children' to 'reports' because I found that more agreeable. You can, of course, choose whatever you like.

from pprint import pprint
from random import shuffle
from collections import defaultdict
import json

def show_val(title, val):
    """
    Debugging print helpler.
    """
    sep = '-' * len(title)
    print "\n{0}\n{1}\n{2}\n".format(sep, title, sep)
    pprint(val)


# ORIGINAL NAMES:   emp_id, emp_name, mgr_id
# SIMPLIFIED NAMES: eid,    name,     mid
text = """
323,The Boss,
4444,Manager Joe,323
3,Manager Sally,323
4,Peon Frank,4444
33,Peon Dave,3
5,Peon Jill,4444
6,Peon Rodger,3
7,Peon Ralph,3
233,Clerk Jane,99
99,Supervisor Henri,3
"""

# parse text into lines
lines = [ l.strip() for l in text.strip().splitlines() ]

# construct list of people tuples
people = [ tuple(l.split(',')) for l in lines ]

# for demonstration and testing only, shuffle the results
shuffle(people)
show_val("randomized people", people)

# contstruct list of parents
parents = defaultdict(list)
for p in people:
    parents[p[2]].append(p)
show_val("parents", parents)

def buildtree(t=None, parent_eid=''):
    """
    Given a parents lookup structure, construct
    a data hierarchy.
    """
    parent = parents.get(parent_eid, None)
    if parent is None:
        return t
    for eid, name, mid in parent:
        report = { 'name': name }
        if t is None:
            t = report
        else:
            reports = t.setdefault('reports', [])
            reports.append(report)
        buildtree(report, eid)
    return t

data = buildtree()
show_val("data", data)

show_val("JSON", json.dumps(data))

Running this shows the following output:

-----------------
randomized people
-----------------

[('233', 'Clerk Jane', '99'),
 ('4444', 'Manager Joe', '323'),
 ('33', 'Peon Dave', '3'),
 ('6', 'Peon Rodger', '3'),
 ('99', 'Supervisor Henri', '3'),
 ('3', 'Manager Sally', '323'),
 ('5', 'Peon Jill', '4444'),
 ('323', 'The Boss', ''),
 ('4', 'Peon Frank', '4444'),
 ('7', 'Peon Ralph', '3')]

-------
parents
-------

defaultdict(<type 'list'>, {'99': [('233', 'Clerk Jane', '99')], '323': [('4444', 'Manager Joe', '323'), ('3', 'Manager Sally', '323')], '3': [('33', 'Peon Dave', '3'), ('6', 'Peon Rodger', '3'), ('99', 'Supervisor Henri', '3'), ('7', 'Peon Ralph', '3')], '4444': [('5', 'Peon Jill', '4444'), ('4', 'Peon Frank', '4444')], '': [('323', 'The Boss', '')]})

----
data
----

{'name': 'The Boss',
 'reports': [{'name': 'Manager Joe',
              'reports': [{'name': 'Peon Jill'}, {'name': 'Peon Frank'}]},
             {'name': 'Manager Sally',
              'reports': [{'name': 'Peon Dave'},
                          {'name': 'Peon Rodger'},
                          {'name': 'Supervisor Henri',
                           'reports': [{'name': 'Clerk Jane'}]},
                          {'name': 'Peon Ralph'}]}]}

----
JSON
----

'{"name": "The Boss", "reports": [{"name": "Manager Joe", "reports": [{"name": "Peon Jill"}, {"name": "Peon Frank"}]}, {"name": "Manager Sally", "reports": [{"name": "Peon Dave"}, {"name": "Peon Rodger"}, {"name": "Supervisor Henri", "reports": [{"name": "Clerk Jane"}]}, {"name": "Peon Ralph"}]}]}'

Some preliminaries: It also uses print to help show nested data structures. We'd normally get data through a database connection, here we just parse it out of static text. Finally, while the data you presented is beautifully ordered, with the bosses at the top and with the lowest employeed id numbers (simplifying the) problem, I'd like to confirm that the code works in any order. So I've modified some of the id numbers to reflect a non-sequential allocation, and brought in random.shuffle to randomize the order of data. You wouldn't do this in production, but as a part of testing, it increases confidence the logic is working by design, not accident.

Related Topic