Data Structures – How to Store and Tag Millions of Images

data structuresimagenosql

I am building an application where I need to store millions of images and later tag them. The tags attributed to the images could change over time as the tagging system evolves. Images will then be searched for by tags.

In terms of storing the files, I have eliminated the option of storing them in a RDBMS; I have tried this in the past and ran into scaling and performance problems and likewise I have eliminated the option of storing them on a file system as that too has given me performance, scalability and backup issues. I am now considering using a NOSQL key-value data store or something like Amazon S3. Is a key-value store an appropriate choice for this type of data?

In terms of storing the tag data for each image, since the tag types are unknown upfront I am looking to leverage the schemaless nature of NOSQL and use either and document data store or perhaps a column profile one. What would be the key factors in deciding which type of store to use? Are there other options that I should consider?

Lastly, does it make sense to split the image data and meta data into separate stores or is there a technology that can do both? Perhaps something like a key value store that also allows the addition of Metadata and querying against the Metadata?

Update: I have seen the previous answers but they are a few years old and do not seem to be leveraging contemporary technologies. Can someone please comment if RDBMS + Filesystem is still the best way to do this or are their newer and improved solutions.

Best Answer

The question is one of scale, where it will be hosted, cost and management. If you know you are going to host in AWS, then you can take advantage of the distributed nature that makes the cloud more scalable.

First Decision: Self Hosted vs the Cloud

The old answers (circa 2014) reflect the mindset when self hosting was still predominant. However, there are reasons to look outside of an RDBMS for tag related queries.

Filesystem hosting requires that you manage your NAS or SAN yourselves and ensure you have enough provisioning and the expertise to improve performance and capacity as necessary. It can be very expensive if the costs are not amortized across several applications.

The cloud allows you to use AWS S3 or whatever equivalent blob storage for your cloud provider. This solution only charges you for the storage you use, and cloud blob storage provides both the scale and performance needed to scale as your application grows.

Second Decision: RDBMS or Search

The way you have to store tags in a relational database vs. a document store makes the queries to get records related to those tags more difficult. This is even more so when you are looking for intersections between tags (i.e. documents that have 2 or more identical tags). The queries will slow down the more complicated it gets.

ElasticSearch, SOLR, and similar search servers that can double as a document store provide an ideal middle ground. Many cloud providers have hosting solutions for these types of problems. They are designed to scale to very large sizes and perform searches very quickly. In fact this site (softwareengineering.stackexchange.com) uses ElasticSearch to do queries like this. NOTE: ElasticSearch is also a NoSQL DB in addition to being a search server.

I will say that you can't think in relational terms when you are doing document searches so there is a learning curve.

Added bonus is that at least with AWS, ElasticSearch costs less than an RDBMS for the same size tier.

Bottom Line

Millions of records is not astronomical for today's RDBMS's. However, you will reach a saturation point. Many websites still use an RDBMS for the data storage of record and then synchronize that with a search server for the heavy lifting. That decision really depends on things outside the scope of this question.

The ElasticSearch/S3 route will scale well beyond that. However, do your research. There are tradeoffs that you have to weigh. In my case this choice was the right one.