Storing in-text metadata in a discrete data structure

conceptsdata structuresmarkupseparation-of-concerns

I am developing an application which will need to store inline, intext metadata. What I mean by that is the following: let's say we have a long text, and we want to store some metadata connected with a specific word, or sentence of the text.

What would be the best way to store this information?

My first thought was to include in the text some kind of Markdown syntax that would then be parsed on retrieving. Something looking like this:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam __nonummy nibh__[@note this sounds really funny latin]
euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

This would introduce two problems I can think of:

  1. A relatively small one, is that if said syntax happen to be fortuitously on the said text, it can mess with the parsing.
  2. The most important one is that this doesn't maintain this metadata separate from the text itself.

I would like to have a discrete data structure to hold this data, such a different DB Table in which these metadatas are stored, so that I could use them in discrete ways: querying, statistics, sorting, and so on.


EDIT: Since the answerer deleted his answer, I think it might be good to add his suggestion here, since it was a workable suggestion that expanded on this first concept. The poster suggested to use a similar syntax, but to link the metadata to the PRIMARY KEY of the metadata database table.

Something that would look like this:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam __nonummy nibh__[15432]
euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

Where 15432 would be the ID of a table row containing the necessary, queriable information, as per example below.


My second thought was to store information of this kind in a DB Table looking like this:

TABLE: metadata

ID    TEXT_ID    TYPE    OFFSET_START    OFFSET_END    CONTENT
1     lipsum     note    68              79            this sounds really funny latin

In this way the metadata would have a unique id, a text_id as a foreign key connected to table storing the texts and it would connect the data with the text itself by using a simple character offset range.

This would do the trick of keeping the data separated from the metadata, but a problem that I can immediately see with this approach is that the text would be fundamentally not editable. Or, if I wanted to implement the editing of the text after the assignation of metadata, I would basically have to calculate characters additions, or removal compared to the previous version, and check whether each of this modifications adds or remove characters before or after each of the associated metadata.

Which, to me, sounds like a really unelegant approach.

Do you have any pointers or suggestions for how I could approach the problem?


Edit 2: some XML problems

Adding another case which would make quite necessary for this separation of data and metadata to happen.

  • Let's say I want to make it possible for different users to have different metadata sets of the same text, with or without the possibility of each user actually displaying the other user metadata.

Any solution of the markdown kind (or HTML, or XML) would be difficult to implement at this point. The only solution in this case that I could think about would be to have yet another DB Table which would contain the single user version of the original text, connecting to the original text table by the use of a FOREIGN KEY.

Not sure if this is very elegant either.

  • XML has a hierarchical data model: any element which happens to be within the borders of another element is considered as its child, which is most often not the case in the data model I'm looking for; in XML any children element must be closed before the parent tag can be closed, allowing for no overlapping of elements.

Example:

<note content="the beginning of the famous placeholder"> Lorem ipsum
dolor sit
<comment content="I like the sound of amet/elit"> amet </note>,
consectetuer adipiscing elit </comment> , <note content="adversative?"> sed
diam
<note content="funny latin"> nonummy </note> nibh </note> euismod
tincidunt ut laoreet dolore magna aliquam erat volutpat.

Here we have two different problems:

  1. Different elements overlapping: The first comment starts within the first note, but ends after the end of the first note, i.e. it's not its child.

  2. Same elements overlapping: The last note and the boldfaced note overlap; however, since they are the same kind of element, the parser would close the lastly opened element at the first closure, and the first opened element at the last closure, which, in this circumstance, is not what is intended.

Best Answer

I'd go for a mix of your solutions, but instead, I'd use a standard : XML. You'd have a syntax like this one

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam <note content="It sound really funny in latin">nonummy nibh</note>
euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

Why XML

If you think about it, it's exactly how the whole web is structured : content (actual text) that carries semantic - what you're calling metadata - through html tags.

This way you have a really cool world that opens :

  • Free parser
  • Battle tested way to add metadata to content
  • Ease of use (depending on which users you are targeting)
  • You can easily extract the raw text, without the metadata, as it's a standard features on XML parsers. That is very useful to have an indexable version of your content, so Lorem <note>ipsum</note> is raised when you are searching for lorem ips* for example.

Why XML over Markdown

A website like stackexchange uses markdown as the semantics its content convey is rather basic : emphasis, links/urls, image, header etc. It seems the semantic you're adding to your content is

  1. More complex
  2. Subject to change or must be extensible

Thus I sense Markdown wouldn't be a really good idea. Also Markdown isn't really standardized, and parsing/dumping it might be a pain in the ass, even more a markdownish syntax see Jeff Atwood's post about the WTF he met on parsing Markdown.

On separation between data and metadata

Per se, such separation isn't mandatory. I assume you are looking for the advantage it brings:

  • Possibility to have the raw content without the metadata
  • Separation of concerns: I don't want to have side-effect/complexity overhead when manipulating metadata because of the data, and otherwise.

All these concerns are cleared by the use of XML. From the XML, you can easily dump any tag-stripped content, and data/metadata are separated, just like attribute and actual text is separated in XML.

Also I don't think you can really have your metadata totally not bound to your data. From what you describe, your metadata are a composition of your data, ie deleting the data leads to metadata deletion. This is where you metadata diverge from the usual HTML/CSS. CSS doesn't disapear when an html element is removed, because it can be applied to other elements. I don't feel this is the case in your metadata.

Having metadata close to the data, as in XML or Markdown, allow an easy understanding (and maybe debugging) of the datas. Also, the example you give on your second thought add some complexity, because for each data I'm reading, I need to query the metadata table to get these. If the relation between your data and your metadata is 1:1 or 1:N, then it's IMO clearly useless, and only brings complexity (a good case of YAGNI).