Root cause analysis in event correlation

algorithmsevent

I'm to write an event correlator module for the device we produce. When a fault occurs the resulting avalanche of logs about derivative conditions from various modules is readable only to a person skilled in the task – a layman will be utterly lost. The event correlator is supposed to solve this problem – find root cause and present a friendly message guiding the user to the origin of the problem.

I can collect tokenized events from all the modules and observe the internal state of the main module. I can even write history a few seconds back.

Now, the hard part and my actual question: writing the analysis algorithm. I tried the Wikipedia page, but it's awfully sketchy on the specific Root Cause Analysis part, and the RCA article linked is more about a business practice than about something that can be phrased as computer algorithm. The net is full of event correlators as software to be bought or used, but I have trouble finding anything about making one.

So, summarizing:

  • a history of events and states dumped at the moment of error (the "black box")
  • a set of rules (preferably in human-editable form in some external file)
  • a set of root causes, derived from the events according to the rules.
  • a program interpreting these rules, then processing the black box through them and producing a set of root causes matching.

How do I even approach writing something like that?

  • What would be the format of the ruleset file and a parser for the rules – lexical parsing alone is easy enough but what would it produce?
  • How to represent a rule internally? As some kind of object probably, but how would that object look like? How to chain the rules together in a decision tree?
  • processing black box data through these rules. Seems an awful lot like running data through a script, but the data contains a lot of of noise. How to assure obtaining some results?

Now, some caveats:

  • I'd prefer to avoid bayesian or other statistical approach – I'd prefer a fully deterministic solution so that given set of conditions will produce the same result in laboratory analysis, and the precise set of rules is fully human-readable and human-editable.

  • Not all components are directly monitored, and some faults can be derived only indirectly. The root cause will not always be one particular message in the black box. Sometimes it will be something that led to a certain set of logs.

  • Not always the sequence of events is maintained, as messages sometimes are delayed on the message bus.

  • If the main power goes out, we're not guaranteed to capture all the data.

Examples:

1.

  • Module 1 reports halt; blockade line is active.
  • Module 2 reports halt; blockade line is active.
  • Module 4 reports halt; blockade line is active.
  • PSU module reports external power is okay, internal power has been disabled (per blockade line.)

Root cause: Module 3 failed gracefully, serious hardware condition activated the blockade line correctly but it was unable to send its message out.

2.

  • Module 1 reports halt; blockade line is active.
  • Module 2 reports halt; blockade line is active.
  • Module 4 reports halt; blockade line is active.
  • PSU module reports external power is okay, internal power has been disabled (per blockade line.)
  • Module 3 reports halt; No power in its output line. Activating the blockade line.

Root cause: Burnt fuse of output line of Module 3

Now Module 3 was mean by sending out its message last, while pulling the blockade line first.

3.

  • Module 3 reports halt; No power in its output line. Activating the blockade line.
  • Module 1 reports halt; blockade line is active.
  • PSU module reports external power is okay, internal power has been disabled (per blockade line.)
  • Module 2 reports halt; No power in its output line. Activating the blockade line.
  • Module 4 reports halt; No power in its output line. Activating the blockade line.

Root cause: Main fuse of internal power disengaged. Likely reason: short circuit in the PSU or fuse disengaged manually.

Note how Module 1 got late with detecting lack of power, say, its output was inactive and the capacitors kept it active long enough that it detected blockade line (engaged by other modules) first, before detecting power failure. Also, normally PSU would report a fault condition of internal power going out, but it's simply slower – less sensitive than the output modules, and they will pull the blockade line before it detects power is out, and by then it will assume this is correct behavior. The difference between cases 2. and 3. is only the number of simultaneous "no power" faults: it's extremely unlikely several fuses go out at the same time, while using the main fuse to disable power for service works is the "traditional" approach and it is expectable at least a few modules will detect "power out" if the main fuse is disengaged.

Best Answer

While this is a large and involved topic and hardly answerable as such, let me give you few pointers for your main questions:

What would be the format of the ruleset file and a parser for the rules - lexical parsing alone is easy enough but what would it produce?

The parser entirely depends on your events. Do you have access to in-memory events? do you have to parse the resulting log files? In either case, you can consider this a preparation of input for the algorithm.

The most precise would be in-memory access to the running software/events, but from what you described, this may probably be impossible due to real-time requirements. On the other hand, if you have to parse the log-files and then derive some meaning from them, you ought to ensure that the logfiles are meaningfully parseable, i.e. developers should know that the log-files will be parsed by a tool and how the log messages should look like. If you want to support arbitrary log messages, you'll get into deep trouble.

Also take care not to bind yourself too close to the actual log message. Keep in mind the amount of change required if someone changes the spelling/wording of a log messages and how that affects your tool. Just think: "Oh btw, there was a typo in the message and I fixed it." - "Oh thanks.. so that's why we're getting the RCA wrong". You don't wanna go there.

How to represent a rule internally? As some kind of object probably, but how would that object look like? How to chain the rules together in a decision tree?

Why reinent the wheel? There are several industry-strength rule engines available, which have that already covered pretty well (f.ex. JBoss' Drools Expert). Of course, you still need to do the usual evaluation-cycle to find out if a rule engine is suitable for your project's specific needs.

A completely different approach would be to use a programming language that is already based on rules. However, this requires that you/your team/your organisation is open-minded towards polyglot programming. Also bear in mind, that these languages are not exactly mainstream. We're talking about things like Prolog, Datalog, or CHR here.

processing black box data through these rules. Seems an awful lot like running data through a script, but the data contains a lot of of noise. How to assure obtaining some results?

First off, rule engines are not the most performant things. They are supposed to work on a very high abstraction level, so it makes sense to filter the input first. As mentioned above, you shouldn't work with black box data, because it simply is way too complex (think: full blown natural language processing!). There needs to be some order or structure to the data you have to process.

It will be part of the input preparation to ensure that you filter out ineligible or otherwise uninteresting parts of the data. You may also want to consider an intermediate language where the log message / event data is represented in a standardized way that is fully under your control. In that case, changes to the original data only need to be adapted in the original parser and your tool's core remains unaffected.

Related Topic