Are NoSQL databases the best choice for more efficiently querying large amounts of data

cassandradatabase-designelasticsearchmongodbredis

We are planning to build a travel website in which we will be integrating multiple APIs (eg. DOTW, GTA, Expedia) for Hotels. I have initially tried to use MySQL but since there are huge amounts of data in hotels and it may contain numerous "one to many" relationships with Images, Amenities and Rooms, the search becomes very slow when we have data for around 200000 Hotels. Even fetching all details for just one hotel may results in a JOIN query from at least four tables, and scanning over all hotels records. So we are planning to migrate our product schema to any NoSQL database to make our search as fast as possible.

Also sometimes we need to run certain schedulers on our database for eliminating duplicates from our database and also updating the newly added hotels which are sent by our providers.

Our tech stack is basically on Java, J2EE along with Springs and Hibernate.

I have read about about MongoDB, Cassandra, Redis and ElasticSearch but I am now confused if simply using these tools can optimize the website search performance. If so then what features differ between these tools that could help me make a determination?

Best Answer

I think that your search results can greatly improve through a number of techniques or database design approaches that will improve performance in your typical RDBMS. I suggest looking into and possibly prototyping the following improvements to see if they help you in performance testing first before you commit to an entirely new database technology that will require a great deal of new learning and experience to master.

Essentially you want to avoid the mindset of a "Magic Bullet". There is a misconception that NoSQL can somehow magically solve all of our problems and performance issues with the RDBMS and that might be true sometimes but you really should try improving your database design first.

Identify your Non-Functional Requirements

Specifically identify your acceptable non-functional requirements for performance. Determine what the maximum average query wait time and use that as your goal. If you can tweak your database design to achieve this then you do not need to rearchitect your software to a NonSQL solution.

Avoid binary columns

It sounds like with the Image table that you have binary type media being stored within your database tables. While this is highly dependent on your chosen database implements binary columns, it is generally accepted that binary columns can hurt the performance of your queries. Binary columns typically invalidate the benefits that an index on a table column can provide. If you don't believe me, join to the Image table and run an explain plan and notice how the index is probably not being used.

Use a Content Delivery Network

Instead of storing images and media in database records, store a URL that an application can use to retrieve that image, maybe in a browser. That URL can point to a unique image that is being stored and managed in a Content Delivery Network. There are a number of cloud services that can provide this or you can build your own with a number of tools. This should make all aspects of your application much more efficient.

Evaluate your table indexes

Make sure that if you are not using indexes that you build them for columns that you typically filter on or join against. To be honest, 4 tables is not a great deal of joins for a table so if you follow these guidelines you should see at least a modest improvement of performance.

If you follow these guidelines and you still cannot achieve your performance requirements then perhaps you can evaluate various NoSQL solutions and look for features that might help you.

Related Topic