Algorithms – Data Structures for Pattern Matching

algorithmsdata structurespattern matching

Let's say you have an input file with many entries like these:

date, ticker, open, high, low, close, <and some other values>

And you want to execute a pattern matching routine on the entries(rows) in that file, using a candlestick pattern, for example. (See, Doji) And that pattern can appear on any uniform time interval (let t = 1s, 5s, 10s, 1d, 7d, 2w, 2y, and so on…).

Say a pattern matching routine can take an arbitrary number of rows to perform an analysis and contain an arbitrary number of subpatterns. In other words, some patterns may require 4 entries to operate on others may require more or less.

Say also that the routine (may) later have to find and classify extrema (local and global maxima and minima as well as inflection points) for the ticker over a closed interval, for example, you could say that a cubic function (x^3) has the extrema on the interval [-1, 1]. (See link)

Given that I don't know the size of the input file upon reading? What would be the most natural choice in terms of a data structure? What about an interface that conforms a Ticker object containing one row of data to a collection of Ticker so that an arbitrary pattern can be applied to the data. For example, let's say I have a pattern that requires n elements to form a match, I need to apply the pattern to those elements and then analyze each one of those chunks.

What's the first thing that comes to mind?

I chose a doubly-linked circular linked list that has the following methods:

push_front()
push_back()
pop_front()
pop_back()
[] //overloaded, can be used with negative parameters

But that data structure seems very clumsy, since so much pushing and popping is going on, I have to make a deep copy of the data structure before running an analysis on it.

So, I don't know if I made my question very clear — but the main points are:

What kind of data structures should be considered when analyzing sequential data points to conform to a pattern that does NOT require random access?
What kind of data structures should be considered when classifying extrema of a set of data points?

Updated

Here's the main analyzer loop for my program (in some pseudocode that's a polyglot of a few languages)

data = new circular_linkedList()
// in my case, data is a circular linked list of elements.
foreach(element in dataFeed) {
    data.push_front(element)
}

step_value = Pattern.numberRequiredElementsForMatch()
buffer = new circular_linkedList()
// at this point, you'll have a circular linked list
// where pop_back() will return the oldest entry and 
// entry, and pop_front() will return the newest

// initialize the buffer
for(value in range(step_value()) {
    buffer.push_back(data.pop_back())
    // fill a buffer with the number of
    // values required for a match. 
    // so now buffer will contain
    // [elementk, ... , elementk+n]
    // where buffer[0] < ... < buffer[n]
}
do {
    last = buffer[len(buffer)]
    ...
    first = buffer[0]

    Pattern.match(first, ... , last)
    buffer.pop_front() // remove the first element
    buffer.push_back(data.pop_back()) //add another element to the back


} while (!data.isEmpty())

...

Anyway, that's a basic idea of what I have going on. The problem is that the way I'm doing it now, I have to reiterate through the loop to apply another pattern. (buffer sizes differ) It just seems inefficient to do that. Also, what happens when there's no more data in the buffer and it's not evenly divisible by the number of elements needed by a match?

Best Answer

Vector instead of Linked List

Whatever you do, don't use a doubly linked list for this. It has a lot of allocation and space overhead. The data is read and appended sequentially, there is no random access or random removal of items, so a vector-like data structure would be much more suitable to store data. But even that is overkill for the raw data, as you'll see below.

Data flows through Listeners, then disappears

To do linear pattern matching you don't have to store the data at all, and you'll only have to traverse it once. The idea is to have several matchers listening for their patterns as the data hums along. These store only the data needed to detect a pattern. Any items that cannot be part of a pattern anymore will be forgotten.

I'll describe one way of achieving that. I must warn you that the task you want to perform is not trivial to do efficiently. Judging from your proposal of using a linked list, it may take some time to wrap your head around the principles involved.

Continuously register Matchers listening to the data

Let's start by adding some entities to listen for your patterns in the data. Register a matcher factory for every combination you want to recognize. Typically each pattern would be in its own matcher class, parametrized by the resolution it is looking for.

Now start reading in the data, and feed each item to each matcher factory as you read it. Then have that factory instantiate/reuse a matcher for every place that could be the starting point of a pattern. For example, a matcher with a 7 day resolution could be instantiated each time the first data point for the new week comes in.

Matchers update internal state until they reject/accept a pattern

The ticker items are also fed to each active matcher. Each matcher should track its own internal matching state. For example, a matcher with a 7 day resolution may be accumulating ticker values to calculate the 7 day average. After each 7 days pass, it stores the average in the next position of an array, starting a new average accumulation for subsequent incoming ticker items. This continues until it has seen enough weeks to either reject the or confirm the presence of the pattern. To get some ideas on how to do this, look into 'Finite State Machines'.

Efficiency gains by eliminating duplicate calculations

Of course, if there are multiple matchers that need the data on a 7-day resolution it is not efficient to have each one calculate it on its own. You may build a hierarchy of matchers so intermediate patterns only have to be calculated once. Look into ring buffers for ideas on how to maintain rolling averages (or other aggregate functions)

Related: Parser Generators

So-called 'Parser Generators' do a similar thing automatically for formal grammars. The generated parsers employ a finite state machine to detect hundreds of patterns with about the same effort it would take to recognize just one, and in just one pass of the source data. I imagine such tools may also exist for continuous time-series data, or you could transform their ideas to apply them to your problem.

Hope this helps!

Related Solutions

Database Design Data Structures – Should Database Table Structure Match Its Intended Data Structure?

As a database person I want to throw up at the idea of someone designing a database table that way.

Database tables should be designed using three main criteria:

Data integrity
Performance of querying the database
Data security

A good database design often needs fields that are not used by the application in any way (but may be needed for data imports or auditing or for some other reason) and in many databases, multiple applications will need to access the data and have differing needs.

What looks easy to do for data entry screens may not work at all for reporting (which is often a totally different application that the developer designing only for his logic doesn't even know about) because reports tend to deal with larger data sets than data entry screens which typically only concern one record at a time. So a database struture that looks logical from the application programmers perspective (Like an EAV table, shudder!) can be the worst possible thing to use when getting data back out of the database later. Especially if there will be millions of records.

Databases are optimized to use normalized tables and should be designed with normalization priciples in mind not using Object-oriented principles. Yes there is mismatch between what is best for the database and what is best for an object-oriented application. Yes you have to deal with that and no it isn't a good idea to deal with it by making the database match the orject-orientation.

Java – I need a data structure for a card game

There are a couple things to think about here - the structure of a card and the structure of the collection of cards... and all the while keeping in mind that this is Java and should be written with proper attention to object oriented principles.

A card is a thing. Representing it entirely as a bit in a BitSet, or worse... doing your own bit manipulation and passing around a long really takes you away from those object oriented principles. You're instead dealing with "I've got bit 23 thats a jack of hearts" makes for some ugly coding. Iterating over a bitset or bits set in a long isn't at all fun either. So, lets put that to rest - a bit isn't a card isn't a thing.

There are approaches based on this that aren't wrong for Java. It is not unreasonable to stick all of the cards in an enum:

public enum Card {
  ACE_OF_HEARTS,
  TWO_OF_HEARTS,
  ...
  KING_OF_SPADES
}

You could then use an EnumSet as the collection. Yes, its backed by a BitSet when you dig into it, but it's a nice collection to deal with. It's actually a Collection (a BitSet isn't) and so you get the iterators and contains and the like. A BitSet, you're doing your iteration by hand calling nextSetBit(5) to find out what the index of the next set bit.

I won't say that the enum with all the cards in it is wrong, but it isn't right either. If you read the tutorial on enums from Oracle, you even find...

import java.util.*;

public class Card {
    public enum Rank { DEUCE, THREE, FOUR, FIVE, SIX,
        SEVEN, EIGHT, NINE, TEN, JACK, QUEEN, KING, ACE }

    public enum Suit { CLUBS, DIAMONDS, HEARTS, SPADES }

    private final Rank rank;
    private final Suit suit;
    private Card(Rank rank, Suit suit) {
        this.rank = rank;
        this.suit = suit;
    }

    public Rank rank() { return rank; }
    public Suit suit() { return suit; }
    public String toString() { return rank + " of " + suit; }

    private static final List<Card> protoDeck = new ArrayList<Card>();

    // Initialize prototype deck
    static {
        for (Suit suit : Suit.values())
            for (Rank rank : Rank.values())
                protoDeck.add(new Card(rank, suit));
    }

    public static ArrayList<Card> newDeck() {
        return new ArrayList<Card>(protoDeck); // Return copy of prototype deck
    }
}

And you've got a deck backed by the ArrayList.

Here's a thing to think about though. You can't have two jack of hearts in your hand... well, some games you might, but most games you don't. This means a collection of cards is a Set. This gives you a few options for how to implement the hand in such a way that doesn't mean juggling arrays.

I'd start with a LinkedHashSet as that maintains dealt order, or you might prefer either the TreeSet or ConcurrentSkipListSet if you want the cards to be in sorted order. Note that this isn't as useful for dealing the cards from a deck as you would want that to be in a List because Collections.shuffle takes a list as arguments. You also don't have to worry about pulling cards out of the middle of the deck when dealing. So the LinkedList would work very nicely for the deck itself when dealing cards (see pollFirst if you want to do it that way).

Some games though, you may find that you need to keep it in a deque (such as a LinkedList - so you get the poll first and last). Others you want to keep the cards in a sorted structure, or others you may want something else. It all depends on what you want to do and finding the correct Collection (or not) to do it.

However, the data structure for a card should be that of the card presented by Oracle in the enum tutorial. And then you use the Collection that maps to how you want to hold the cards.