Algorithms – Generating Hierarchical Structure from Relational Data

algorithmsdata structuresparsing

I have a csv of employee ids, names, and a reference column with the id of their direct manager, say something like this

 emp_id, emp_name, mgr_id
1,The Boss,,
2,Manager Joe,1
3,Manager Sally,1
4,Peon Frank,2
5,Peon Jill,2
6,Peon Rodger,3
7,Peon Ralph,3

I would like to be able to generate a (json) object representing this structure, something along the lines of

DATA = {
  "parent": "The Boss",
  "children: [
    {
      "parent": "Manager Joe",
      "children": [ {"parent": "Peon Frank"}, {"parent": "Peon Jill"} ]
    },
    {
      "parent": "Manager Sally",
      "children": [ {"parent": "Peon Rodger"}, {"parent": "Peon Ralph" } ]
    }]
}

So from the data, an entry with no mgr_id represents something like the CEO or Leader.

So just collecting some thoughts.. I know this could be represented by some tree data structure with the tree traversal generating the correct output. The parser would have to be responsible for inserting into the tree. Maybe the number of children would give you a weight? Just gathering thoughts. Not sure how I would pivot around the fact that multiple objects can be constructed with children.

Is there an algorithm defined that can parse this structure I am not thinking of? Descending down into the children seems relatively straightforward. I am not seeing this in my head, and could use some help. Thank you

Best Answer

There are several ways to create structure from flatland. Recursion is one of the best. This program uses it, with a preliminary step of figuring out which items are parents.

While the strategy is language-agnostic, every language has different details of data structures and building blocks. I've rendered the approach in Python.

NB, about half this code is for display and demonstration purposes, so you can follow along and see how the algorithm is working. For production, fee free to remove those parts.

Oh...and I changed the labels from 'parent' to 'name' and 'children' to 'reports' because I found that more agreeable. You can, of course, choose whatever you like.

from pprint import pprint
from random import shuffle
from collections import defaultdict
import json

def show_val(title, val):
    """
    Debugging print helpler.
    """
    sep = '-' * len(title)
    print "\n{0}\n{1}\n{2}\n".format(sep, title, sep)
    pprint(val)


# ORIGINAL NAMES:   emp_id, emp_name, mgr_id
# SIMPLIFIED NAMES: eid,    name,     mid
text = """
323,The Boss,
4444,Manager Joe,323
3,Manager Sally,323
4,Peon Frank,4444
33,Peon Dave,3
5,Peon Jill,4444
6,Peon Rodger,3
7,Peon Ralph,3
233,Clerk Jane,99
99,Supervisor Henri,3
"""

# parse text into lines
lines = [ l.strip() for l in text.strip().splitlines() ]

# construct list of people tuples
people = [ tuple(l.split(',')) for l in lines ]

# for demonstration and testing only, shuffle the results
shuffle(people)
show_val("randomized people", people)

# contstruct list of parents
parents = defaultdict(list)
for p in people:
    parents[p[2]].append(p)
show_val("parents", parents)

def buildtree(t=None, parent_eid=''):
    """
    Given a parents lookup structure, construct
    a data hierarchy.
    """
    parent = parents.get(parent_eid, None)
    if parent is None:
        return t
    for eid, name, mid in parent:
        report = { 'name': name }
        if t is None:
            t = report
        else:
            reports = t.setdefault('reports', [])
            reports.append(report)
        buildtree(report, eid)
    return t

data = buildtree()
show_val("data", data)

show_val("JSON", json.dumps(data))

Running this shows the following output:

-----------------
randomized people
-----------------

[('233', 'Clerk Jane', '99'),
 ('4444', 'Manager Joe', '323'),
 ('33', 'Peon Dave', '3'),
 ('6', 'Peon Rodger', '3'),
 ('99', 'Supervisor Henri', '3'),
 ('3', 'Manager Sally', '323'),
 ('5', 'Peon Jill', '4444'),
 ('323', 'The Boss', ''),
 ('4', 'Peon Frank', '4444'),
 ('7', 'Peon Ralph', '3')]

-------
parents
-------

defaultdict(<type 'list'>, {'99': [('233', 'Clerk Jane', '99')], '323': [('4444', 'Manager Joe', '323'), ('3', 'Manager Sally', '323')], '3': [('33', 'Peon Dave', '3'), ('6', 'Peon Rodger', '3'), ('99', 'Supervisor Henri', '3'), ('7', 'Peon Ralph', '3')], '4444': [('5', 'Peon Jill', '4444'), ('4', 'Peon Frank', '4444')], '': [('323', 'The Boss', '')]})

----
data
----

{'name': 'The Boss',
 'reports': [{'name': 'Manager Joe',
              'reports': [{'name': 'Peon Jill'}, {'name': 'Peon Frank'}]},
             {'name': 'Manager Sally',
              'reports': [{'name': 'Peon Dave'},
                          {'name': 'Peon Rodger'},
                          {'name': 'Supervisor Henri',
                           'reports': [{'name': 'Clerk Jane'}]},
                          {'name': 'Peon Ralph'}]}]}

----
JSON
----

'{"name": "The Boss", "reports": [{"name": "Manager Joe", "reports": [{"name": "Peon Jill"}, {"name": "Peon Frank"}]}, {"name": "Manager Sally", "reports": [{"name": "Peon Dave"}, {"name": "Peon Rodger"}, {"name": "Supervisor Henri", "reports": [{"name": "Clerk Jane"}]}, {"name": "Peon Ralph"}]}]}'

Some preliminaries: It also uses print to help show nested data structures. We'd normally get data through a database connection, here we just parse it out of static text. Finally, while the data you presented is beautifully ordered, with the bosses at the top and with the lowest employeed id numbers (simplifying the) problem, I'd like to confirm that the code works in any order. So I've modified some of the id numbers to reflect a non-sequential allocation, and brought in random.shuffle to randomize the order of data. You wouldn't do this in production, but as a part of testing, it increases confidence the logic is working by design, not accident.

Why XML

If you think about it, it's exactly how the whole web is structured : content (actual text) that carries semantic - what you're calling metadata - through html tags.

This way you have a really cool world that opens :

Free parser
Battle tested way to add metadata to content
Ease of use (depending on which users you are targeting)
You can easily extract the raw text, without the metadata, as it's a standard features on XML parsers. That is very useful to have an indexable version of your content, so Lorem <note>ipsum</note> is raised when you are searching for lorem ips* for example.

Why XML over Markdown

A website like stackexchange uses markdown as the semantics its content convey is rather basic : emphasis, links/urls, image, header etc. It seems the semantic you're adding to your content is

More complex
Subject to change or must be extensible

Thus I sense Markdown wouldn't be a really good idea. Also Markdown isn't really standardized, and parsing/dumping it might be a pain in the ass, even more a markdownish syntax see Jeff Atwood's post about the WTF he met on parsing Markdown.

On separation between data and metadata

Per se, such separation isn't mandatory. I assume you are looking for the advantage it brings:

Possibility to have the raw content without the metadata
Separation of concerns: I don't want to have side-effect/complexity overhead when manipulating metadata because of the data, and otherwise.

All these concerns are cleared by the use of XML. From the XML, you can easily dump any tag-stripped content, and data/metadata are separated, just like attribute and actual text is separated in XML.

Also I don't think you can really have your metadata totally not bound to your data. From what you describe, your metadata are a composition of your data, ie deleting the data leads to metadata deletion. This is where you metadata diverge from the usual HTML/CSS. CSS doesn't disapear when an html element is removed, because it can be applied to other elements. I don't feel this is the case in your metadata.

Having metadata close to the data, as in XML or Markdown, allow an easy understanding (and maybe debugging) of the datas. Also, the example you give on your second thought add some complexity, because for each data I'm reading, I need to query the metadata table to get these. If the relation between your data and your metadata is 1:1 or 1:N, then it's IMO clearly useless, and only brings complexity (a good case of YAGNI).

Best Answer

Related Solutions

This domain of study

Storing in-text metadata in a discrete data structure

Why XML

Why XML over Markdown

On separation between data and metadata

Related Topic