Algorithms – Data Structures for Pattern Matching

algorithmsdata structurespattern matching

Let's say you have an input file with many entries like these:

date, ticker, open, high, low, close, <and some other values>

And you want to execute a pattern matching routine on the entries(rows) in that file, using a candlestick pattern, for example. (See, Doji) And that pattern can appear on any uniform time interval (let t = 1s, 5s, 10s, 1d, 7d, 2w, 2y, and so on…).

Say a pattern matching routine can take an arbitrary number of rows to perform an analysis and contain an arbitrary number of subpatterns. In other words, some patterns may require 4 entries to operate on others may require more or less.

Say also that the routine (may) later have to find and classify extrema (local and global maxima and minima as well as inflection points) for the ticker over a closed interval, for example, you could say that a cubic function (x^3) has the extrema on the interval [-1, 1]. (See link)

Given that I don't know the size of the input file upon reading? What would be the most natural choice in terms of a data structure? What about an interface that conforms a Ticker object containing one row of data to a collection of Ticker so that an arbitrary pattern can be applied to the data. For example, let's say I have a pattern that requires n elements to form a match, I need to apply the pattern to those elements and then analyze each one of those chunks.

What's the first thing that comes to mind?

I chose a doubly-linked circular linked list that has the following methods:

push_front()
push_back()
pop_front()
pop_back()
[] //overloaded, can be used with negative parameters

But that data structure seems very clumsy, since so much pushing and popping is going on, I have to make a deep copy of the data structure before running an analysis on it.

So, I don't know if I made my question very clear — but the main points are:

  1. What kind of data structures should be considered when analyzing sequential data points to conform to a pattern that does NOT require random access?
  2. What kind of data structures should be considered when classifying extrema of a set of data points?

Updated

Here's the main analyzer loop for my program (in some pseudocode that's a polyglot of a few languages)

data = new circular_linkedList()
// in my case, data is a circular linked list of elements.
foreach(element in dataFeed) {
    data.push_front(element)
}

step_value = Pattern.numberRequiredElementsForMatch()
buffer = new circular_linkedList()
// at this point, you'll have a circular linked list
// where pop_back() will return the oldest entry and 
// entry, and pop_front() will return the newest

// initialize the buffer
for(value in range(step_value()) {
    buffer.push_back(data.pop_back())
    // fill a buffer with the number of
    // values required for a match. 
    // so now buffer will contain
    // [elementk, ... , elementk+n]
    // where buffer[0] < ... < buffer[n]
}
do {
    last = buffer[len(buffer)]
    ...
    first = buffer[0]

    Pattern.match(first, ... , last)
    buffer.pop_front() // remove the first element
    buffer.push_back(data.pop_back()) //add another element to the back


} while (!data.isEmpty())

...

Anyway, that's a basic idea of what I have going on. The problem is that the way I'm doing it now, I have to reiterate through the loop to apply another pattern. (buffer sizes differ) It just seems inefficient to do that. Also, what happens when there's no more data in the buffer and it's not evenly divisible by the number of elements needed by a match?

Best Answer

Vector instead of Linked List

Whatever you do, don't use a doubly linked list for this. It has a lot of allocation and space overhead. The data is read and appended sequentially, there is no random access or random removal of items, so a vector-like data structure would be much more suitable to store data. But even that is overkill for the raw data, as you'll see below.

Data flows through Listeners, then disappears

To do linear pattern matching you don't have to store the data at all, and you'll only have to traverse it once. The idea is to have several matchers listening for their patterns as the data hums along. These store only the data needed to detect a pattern. Any items that cannot be part of a pattern anymore will be forgotten.

I'll describe one way of achieving that. I must warn you that the task you want to perform is not trivial to do efficiently. Judging from your proposal of using a linked list, it may take some time to wrap your head around the principles involved.

Continuously register Matchers listening to the data

Let's start by adding some entities to listen for your patterns in the data. Register a matcher factory for every combination you want to recognize. Typically each pattern would be in its own matcher class, parametrized by the resolution it is looking for.

Now start reading in the data, and feed each item to each matcher factory as you read it. Then have that factory instantiate/reuse a matcher for every place that could be the starting point of a pattern. For example, a matcher with a 7 day resolution could be instantiated each time the first data point for the new week comes in.

Matchers update internal state until they reject/accept a pattern

The ticker items are also fed to each active matcher. Each matcher should track its own internal matching state. For example, a matcher with a 7 day resolution may be accumulating ticker values to calculate the 7 day average. After each 7 days pass, it stores the average in the next position of an array, starting a new average accumulation for subsequent incoming ticker items. This continues until it has seen enough weeks to either reject the or confirm the presence of the pattern. To get some ideas on how to do this, look into 'Finite State Machines'.

Efficiency gains by eliminating duplicate calculations

Of course, if there are multiple matchers that need the data on a 7-day resolution it is not efficient to have each one calculate it on its own. You may build a hierarchy of matchers so intermediate patterns only have to be calculated once. Look into ring buffers for ideas on how to maintain rolling averages (or other aggregate functions)

Related: Parser Generators

So-called 'Parser Generators' do a similar thing automatically for formal grammars. The generated parsers employ a finite state machine to detect hundreds of patterns with about the same effort it would take to recognize just one, and in just one pass of the source data. I imagine such tools may also exist for continuous time-series data, or you could transform their ideas to apply them to your problem.

Hope this helps!

Related Topic