Database – the best data model to represent mathematical range (in database, xml, json…)

data modelingdatabasemathxml

For example,

greater or equal to 50 and smaller than 100 (>=50 && < 100)

smaller than 10 or greater than 40 (<10 || >40)

I have been thinking about how to represent mathematical range in a file and database, the range may be input by non programmer and I need to keep the input simple, but at another side, it also need to keep the input easy to convert to data and easy to check error input e.g.: <10 || >100 seems the most simple but it is harder for me to parse the string to get the data, also need to consider input format error

I have been considering some input methods, using >=50 && < 100 as example, which is in key value form:

  1. Using one string to represent whole range:

    <rangeInString>=50 && < 100</rangeInString>
    
  2. Separate two strings, one represent lower bound and another one represent upper bound, then parse each string in program:

    <lowerBound> >=50 </lowerBound>
    <upperBound> <100 </upperBound>
    
  3. Separate lower and upper bound, also separate the sign from number:

    <lowerBound>
      <sign> >= </sign>
      <data>50</data>
    </lowerBound>
    <upperBound>
      <sign> < </sign>
      <data>100</data>
    </upperBound>
    
  4. Separate lower bound and upper bound,also separate sign, and also separate the case that if includes the equal condition:

    <lowerBound>
      <sign> > </sign>
      <isIncludeEqual>true</isIncludeEqual>
      <data>50</data>
    </lowerBound>
    <upperBound>
      <sign> < </sign>
      <isIncludeEqual>false</isIncludeEqual>
      <data>100</data>
    </upperBound>
    
  5. auto detect using && or ||, e.g. :>= A with < B,if A < B, must be && e.g. (>= 50 && <100), otherwise it is || e.g. (>= 100 || <50):

    <A>
      <sign> > </sign>
      <isIncludeEqual>true</isIncludeEqual>
      <data>50</data>
    </A>
    <B>
      <sign> < </sign>
      <isIncludeEqual>false</isIncludeEqual>
      <data>100</data>
    </B>
    
  6. Use a field "isAnd" to separate >=50 && < 100 (true) and <=50 || > 100 (false) instead of using field sign "<" and ">" :

    <lowerBound>
      <isIncludeEqual>true</isIncludeEqual>
      <data>50</data>
    </lowerBound>
    <upperBound>
      <isIncludeEqual>false</isIncludeEqual>
      <data>100</data>
    </upperBound>
    <isAnd>true</isAnd>
    
  7. other data model…

I need to consider somethings:

  1. easy for non programmer to input

  2. easy to convert or parse to data into program

  3. easy to check error, for example,parse string increase the complexity of converting data and checking incorrect format, also there may have other incorrect format,e.g.: <=50 && >100 should not be valid, I may allow auto detect using && or || by the sign of input, but it may increase the complexity of the code

Best Answer

If you're working solely with integer ranges, you're probably over-thinking the problem. All you need to store is the lower bound and the upper bound, and use a fixed rule about whether the bounds are inclusive. For example, Python ranges tend to use a half-closed interval, so that:

range(5,10)

Is what mathematicians would write as:

[5, 10)         # interval notation
5 <= x < 10     # inequality notation
5, 6, 7, 8, 9   # list notation

This is easily encoded into XML, JSON, or database values. E.g.:

<range low="5" high="10"/>   # XML
[5, 10]                      # simple list (brackets are data delimiters, 
                                            not mathematical interval notation)
{ "low": 5, "high": 10 }     # JSON fragment 

It's very easy to take a fixed set of rules, which seems very rigid, and get back whatever you want. For example, if you really wanted the 10 high bound included, then don't provide lots of additional flexibility machinery and complexity (which costs extra data encoding bytes, testing requirements, and code complexity). Just encode it as:

<range low="5" high="11"/>  # XML

range(5, 11)                # Python

If you don't like the asymmetry of a half-open interval, there's no reason you can't choose a fully closed / inclusive interval, or any other option (i.e., fully open interval, or half-open at the start interval). The trick is choosing a rule that is consistent throughout, and is most consistent / convenient with your programming environment.

This representation nicely handles finite ranges. It's very compact, can be implemented in a very performance-efficient way, and if you'd like extended operations, can be easily implemented as a neat class in Java, Python, and many other languages.

You also seem to want to represent discontinuous, infinite sets such as x < 10 ∪ x > 40. Those are generally not considered "ranges" by most programming languages / systems, and indeed aren't even intervals in the mathematical sense. But they're obviously interesting adjacent concepts to consider--though it will add notable complexity to the implementation.

For example. your disjoint set can be specified as a "multi-range," a combination of individual ranges:

<union>
    <range low="-Infinity" high="10"/>
    <range low="41" high="Infinity"/>
</union>

Now you've had to add handling for abstract values (negative and positive infinity) as well as a set combining operation (union) as the multi-range constructor. But, I've read the source of a number of numerical set and multi-range modules, and this is actually how many of them do it.

When you get going down this path of generalizing from individual intervals / ranges to multi-interval, multi-range sets, you may also start to generalize the set operations into other constructive formulations. If you do union, why not intersection, difference, and symmetric-difference as well? Your disjoint range could also be specified alternatively as:

<difference>
    <range low="-Infinity" high="Infinity"/>
    <range low="10" high="41"/>
</difference>

But you've got to be careful, because the more operations and the more complexity, the greater chance of errors creeping in.