Mysql – Is it possible to Managing 20 TB data using MySQL

databasehadoophbaseMySQL

I am working in a project and my job is to build a database system to manage about 60,000,000,000 data entries.

The project background is I have to do real-time storage for large number of messages that read from about 30,000 RFID readers every second. Suppose every RFID reader generates 6,000 messages per one day, I have to insert 180,000,000 entries into the database.

A possible data entry is like "time_stamp, Reader_ID, Tag_ID, other_msg_content"

The will be queries (SELECT) base on time range, Reader_ID, and Tag_ID. The queries won't be very complicated.

Now I am designing the database system, and I plan to use MySQL. My dump questions are:

  1. Is it wise to use MySQL, or I should resort to Oracle (which is expensive), or HBase?

  2. If I must use MySQL, any idea how I can build the cluster?

  3. If I insert the messages into a table, soon the table will be very long. I would like to use Sharding techniques to split a long table to many short tables.

    3.a. I want to know the proper length for a MySQL InnoDB table, i.e., after how many data entries were inserted, I shall start to sharding?

    3.b. Is there any good sharding proxy solution out there? I know spock proxy and some others, need recommendations.

  4. Do I have to use MySQL Cluster? OR I just use mysql master servers and sharding slaves, and use Replication to achieve high availability?

  5. Assume I have to handle 20 TB data in MySQL (for 1 year), I plan to use 20 nodes (PC server, cheap), and to store 1 TB data per node, is it possible? Any comments are welcomed.

Many thanks.

Best Answer

Thoughts:

  • If you are asking these question on a public forum, hire experts to do it for you
  • Consider Postgres and SQL Server which will scale to this volume too
  • Do you need ACID? No = consider NoSQL
  • Design and hardware matter more than the platform
  • Don't virtualise or cut other hardware corners
  • What is your RPO/RTO?
  • Maintenance window? a.k.a are you really 24/7/365? a.k.k 30k rows per second all the time
  • Archiving?
  • Do you need older (say 6 months) online?
  • Budget?
  • Realistic testing required to validate architecture and design for stated load
  • 20 TB is probably too small
  • 6k per RFID per day but 30k per second? There are 86.4k seconds per day so only 1 in 14 RFIDs write per second: what about potential peak loads of 420k+ rows per second

Finally

  • This isn't a database question but an architecture question
  • You're asking the wrong questions, too soon for this requirement