Database – Data durability guarantees in Kafka

apache-kafkadatabaseenterprise-architecture

Is it wise to use kafka as the 'source of truth' for mission critical data?

The setup is:

  • kafka is the underlying source-of-truth for the data.
    -querying is done on caches (I.e. Redis, ktables) hydrated from kafka
  • Kafka configured for durability (infinite topic retention, 3+ replication factor etc)
  • architecture follows CQRS pattern (writes to kafka, reads from the caches)
  • architecture allows for eventual consistency between reads & writes

We are not allowed to lose data in any circumstances

In theory the replication should guarantee durability & resiliency. Confluent themselves encourage the above pattern.

The only flaws I can think of, are:

  • cache blows up and needs to be rehydrated from scratch -> query
  • broker disk gets wiped/corrupted -> kafka rebalance, resulting in prolonged downtime if topics contain mountains of data

Has anyone run and battle tested this kind of setup in production? I.e. encountered disk corruption, brokers going down, but still retain data?

My gut feel is, this is a bad setup, because kafka wasn't originally designed for RMDBS levels of durability, but can't point to a concrete reason why this would be the case.

Best Answer

We are not allowed to lose data in any circumstances

Another possible flaw could be in Kafka itself. I mean in some rare circumstances (a potential future version of) Kafka can corrupt its own data.

It happened to Postgres several times. I remember a recent issue which led to corrupted data with different glibc versions on master and replica. Also, until Postgres started to handle errors during fsync with panic (shutdown), it could also corrupt data. That's for a product that is dedicated to persisting your data, albeit with more features, therefore larger bug surface.

In my opinion, the lesson is to have an archive of your data stored separately in a colder storage where there's lower chance that some process will mess them up. Also, having two storage systems, e. g. Kafka and S3 means that if one of them messes up your data, it's less likely the other one does it at the same time.

I guess it all depends on your definition of any circumstances.

Related Topic