As per Should I ask my question about which data structure to use here? I'm asking this here. Hopefully this isn't too implementation specific.

I'm currently developing a program that will represent chemical structures (and eventually allow manipulations/reactions of them). I've hashed out two different systems of storing this data but I'm not sure how to pick which one.

Option 1: Nested matrices

Chemicals have (more or less) definite structures and relative positions of items, and it seems as though it would be easier to represent something like that with a matrix – then I can assign definite 'coordinates' to each atom or bond. Unfortunately, you end up with some overlap if you have a structure that branches out a fair bit. To solve this I ended up working out a solution that used nested, 5×5 matrices.

[[None,  None, [CH3], None, None ],
 [None,  None, Bond,  None, None ],
 [[CH3], Bond, C,     Bond, [CH3]],
 [None,  None, Bond,  None, None ],
 [None,  None, [CH3], None, None ]]

Here C represents the element carbon (this could be represented a couple different ways, and isn't particularly relevant here). Bond represents a bond, and [CH3] represents another 5×5 matrix. This is the uppermost ([0][2]) CH3

[[None, None, H,    None, None],
 [None, None, Bond, None, None],
 [H,    Bond, C,    Bond, H   ],
 [None, None, Bond, None, None],
 [None, None, Root, None, None]]

Here the lowermost Bond is a reference to the same object as the uppermost Bond in the first structure, while Root is (some undetermined) way of relating that this is where the matrix connects to the next level up. If this were a more complex structure, one of the Hs would be replaced by another matrix, such as an additional [CH3].

Advantages of this method:

Everything is explicitly laid out, including structure, which is good
according to The Zen of Python.
When these chemicals are (eventually) represented graphically it will (theoretically) be much easier to make sure everything is in its place

Disadvantages:

This isn't readable at all if you have a more complex structure
This is nested, not flat
This is going to require a lot of recursion to walk through

Option 2: Dictionaries

Another way I could represent these with significantly less nesting and better readability would be using dictionaries. Each molecule would be a key in a dictionary {Molecule1 : Data, Molecule2 : Data} and the data would be dictionaries as well Atoms: Data, Bonds: Data. This requires quite a bit less nesting, but would require a little more work when I eventually want to display them.

Closing thoughts

Before asking this question, and indeed while writing it, I've been pretty set on dictionaries. It seems like a cleaner and easier method of storing this data, and won't require figuring out a really nasty recursive technique to walk through a molecule. However I wouldn't be surprised if I missed something which is why I'm asking here.

If this seems too implementation specific please let me know what you think I should change and I'll do my best to make it more general

Best Answer

The "matrix method" has several problems:

Why only two dimensions? Very few molecules are flat, two dimensional structures (unless you work exclusively with graphene). Pretty much any program deals with chemical structures will want to work in 3D (excluding ones whose sole purpose is to draw structural formulas). By collapsing everything into 2D, you are losing valuable information.
What benefit would this method give? For one thing, finding adjacent atoms is rather complex: you have to look at adjacent neighbors in the matrix, of which you only get a maximum of 4 before you have to branch off using "Root". The artificial restriction of 4 neighbors in a matrix will make your code very ugly.

My only guess as to why you want to use this is because you want to eventually display the molecules. The display is the complicated part (flattening a 3D molecule into a 2D formula is a bit of an art even for a human), but you should seek to separate the display logic from the internal data that stores the molecular structure.

I'm not entirely clear what your "dictionary method" means: for example, what does Atoms denote? Are they just arbitrary labels, IDs, or the element's symbol? (Keep in mind that dictionary keys must be unique so element symbol won't work.)

You can take advantage of the fact that Python allows you to have cyclic references and construct graphs of the atoms. This allows you easily figure out the adjacent atoms.

Here's an example:

class Atom(object):
    def __init__(self, element, position):
        self.element = element
        self.position = position
        # each one for a different bond order, starting at 1
        self.bonds = [set(), set(), set(), set()]
    def bond_with(self, other, order=1):
        self.bonds[order].add(other)
        other.bonds[order].add(self)

# construct hydrogen fluoride
H = Atom("H", (0.0, 0.0, 0.0))
F = Atom("F", (0.0, 0.0, 1.0))
H.bond_with(F)
HF = set((H, F))

Python – Should I use nested matrices or dictionaries

Option 1: Nested matrices

Option 2: Dictionaries

Closing thoughts

Best Answer

Related Topic

Option 1: Nested matrices

Option 2: Dictionaries

Closing thoughts

Best Answer

Related Solutions

Appropriate data structure for deeply nested data

Related Topic