Vector instead of Linked List
Whatever you do, don't use a doubly linked list for this. It has a lot of allocation and space overhead. The data is read and appended sequentially, there is no random access or random removal of items, so a vector-like data structure would be much more suitable to store data. But even that is overkill for the raw data, as you'll see below.
Data flows through Listeners, then disappears
To do linear pattern matching you don't have to store the data at all, and you'll only have to traverse it once. The idea is to have several matchers listening for their patterns as the data hums along. These store only the data needed to detect a pattern. Any items that cannot be part of a pattern anymore will be forgotten.
I'll describe one way of achieving that. I must warn you that the task you want to perform is not trivial to do efficiently. Judging from your proposal of using a linked list, it may take some time to wrap your head around the principles involved.
Continuously register Matchers listening to the data
Let's start by adding some entities to listen for your patterns in the data. Register a matcher factory for every combination you want to recognize. Typically each pattern would be in its own matcher class, parametrized by the resolution it is looking for.
Now start reading in the data, and feed each item to each matcher factory as you read it. Then have that factory instantiate/reuse a matcher for every place that could be the starting point of a pattern. For example, a matcher with a 7 day resolution could be instantiated each time the first data point for the new week comes in.
Matchers update internal state until they reject/accept a pattern
The ticker items are also fed to each active matcher. Each matcher should track its own internal matching state. For example, a matcher with a 7 day resolution may be accumulating ticker values to calculate the 7 day average. After each 7 days pass, it stores the average in the next position of an array, starting a new average accumulation for subsequent incoming ticker items. This continues until it has seen enough weeks to either reject the or confirm the presence of the pattern. To get some ideas on how to do this, look into 'Finite State Machines'.
Efficiency gains by eliminating duplicate calculations
Of course, if there are multiple matchers that need the data on a 7-day resolution it is not efficient to have each one calculate it on its own. You may build a hierarchy of matchers so intermediate patterns only have to be calculated once. Look into ring buffers for ideas on how to maintain rolling averages (or other aggregate functions)
Related: Parser Generators
So-called 'Parser Generators' do a similar thing automatically for formal grammars. The generated parsers employ a finite state machine to detect hundreds of patterns with about the same effort it would take to recognize just one, and in just one pass of the source data. I imagine such tools may also exist for continuous time-series data, or you could transform their ideas to apply them to your problem.
Hope this helps!
Data Structures are, for the most part:
- Memory resident,
- Transient,
- Limited in size,
- Not re-entrant without adding concurrency mechanisms like locks or immutability,
- Not ACID compliant,
- Fast, if chosen carefully.
Databases are, for the most part:
- Disk-bound,
- Persistent,
- Large,
- Safely concurrent,
- ACID compliant, with transactional capabilities,
- Slower than data structures
Data structures are meant to be passed from one place to another, and used internally within a program. When was the last time you sent data from a web page to a web server using a database, or performed a calculation on a database that was entirely resident in memory?
Database systems use data structures as part of their internal implementation. It's a question of size and
scope; you use data structures within your program, but a database system is a program in its own right.
Best Answer
For date of birth and people in general, one will need to validate if that person exists and if thier address is valid.
You can do both of these using 3rd party services that will:
You can do this yourself using algorithms that you have mentioned, but why re-invent the wheel and most likely your implementation will be flawed.
Most of these services are configurable and allow you to send information like name, DOB, phone number and will attempt to look up to see if that person exists.
The same holds true for address information. It will attempt standardize the address, locate it (see if it really exists) and then return the standardized address back to you.
In both of these scenarios, no hits can be returned, which means that the person is not found or the addresss is not found. In those cases you probably have a fake individual or address like the one in your sample data.
Also, sometimes it might return multiple matches or less than 100% matches. For example, if the only piece of valid information is a phone number, it might return back multiple people at that phone number.
For less than absolute matches you might set a confidence factor which you want to match on, like 90%. Anything below that number goes into a report which will require manual intervention. People who don't match should be eliminated from your datastore as it is junk data.