Database Design – Preventing Duplicate Records

databasedesign

A webservice that I call returns a list of data. The data from the webservice is updated periodically, so a call to the webservice done in one hour could return the same data as a call done in an hour. Also, the data is returned based on a start and end date.

We have multiple users that can run the webservice search, and duplicate data is most likely to be returned (especially for historical data). However I don't want to insert this duplicate data in the database.

I've created a db table in which the data is stored (most important columns are)

Id int autoincrement PK  
Date date not null        --The date to which the data set belongs.  
LastUpdate date not null  --The date the data set was last updated.  
UserName varchar(50)      --The name of the user doing the search.

I use sql server 2008 express with c# 4.0 and visual studio 2010. Entity Framework is used as the ORM. If stored procedures could be avoided in the proposed solution, then that will be a plus.

Another way of looking interpreting what I'm asking a solution for is as follows:
I have a million unique records in my table. A user does a new search. The search results from the user contains around 300k records of the data that is already in the db. An efficient solution to finding and inserting only the unique records is needed.

A combination of the Date, LastUpdate and UserName makes a record unique.

Best Answer

Well, the obvious solution is to have a unique key on the columns that make the row unique.

A combination of the Date, LastUpdate and UserName makes a record unique.

Alternatively, you might just get rid of the surrogate key and use the above as the primary key (depends upon where else you are using it).

Inserts should be done using Merge, which will allow you to insert the record only when it doesn't already exist.

Related Solutions

Database – SaaS model, 1DB per client

This is the classic single-tenant vs multi-tenant dilemma. To service multiple clients, should you:

Have a single database that includes the data for all your clients (i.e. make them all tenants of your single database), or
Have multiple databases, with each client partitioned off and the single tenant interacting with just its own database.

Most of the SaaS (and other XYZ-as-a-service) shops I know, and almost all of the Internet-scale properties (Google, Facebook, Twitter, etc.), strongly prefer the multi-tenant model. Multi-tenant is generally more efficient, for performance perhaps, but even more so for minimizing administrative overhead. There's just one database to set up and worry about, with data for different clients separated by client_id values. But there is no magical solution. If you "put all your eggs in one basket," then you have to work extra hard to protect that basket. And all-in-one databases can rapidly grow large if you're successful and have many clients, requiring investment in distribution, replication, indexing, and other scalability and quality-of-service optimizations. Developers of SaaS and Internet-scale services invest enormous time and energy in their infrastructure.

Some as-a-service shops do use single-tenant configurations, with clients each having their own database instance (and often, separate virtual machines or virtual environments). One reason is the obvious "greater isolation" virtue, which is often perceived as having stronger security and privacy protections. This isn't just a technical attribute; it's also a question of customer preference and ease of sale.

Single- and multi-tenant are great labels, but there are a lot of middle-ground and hybrid approaches between the two poles. The single database manager running multiple "database instances" that you seem to be describing, for instance, is the architecture used by many WordPress hosts. It's kind of a "multi-tenant lite" or "housemate" scenario. For getting going and smaller configs, it has lot of virtues.

Database – How would you design a user database with custom fields

Please consider this as an alternative. The previous two examples will both require that you make changes to the schema as the application's scope grows in addition the "custom_column" solution is difficult to extend and maintain. Eventually you'll end up with Custom_510 and then just imagine how awful this table will be to work with.

First let's use your Companies schema.

[Companies] ComnpanyId, COMPANY_NAME, CREATED_ON

Next we'll also use your Users schema for top level required attributes that will be used/shared by all companies.

[Users] UserId, COMPANY_ID, FIRST_NAME, LAST_NAME, EMAIL, CREATED_ON

Next we build a table where we will define our dynamic attributes that are specific to each companies custom user attributes. So here an example value of the Attribute column would be "LikeMusic":

[UserAttributeDefinition] UserAttributeDefinitionId, CompanyId, Attribute

Next we define a UserAttributes table that will hold user attribute values

[UserAttributes] UserAttributeDefinitionId, UserId, Value

This can be modified in many ways to be better for performance. You can use multiple tables for UserAttributes making each one specific to the data type being stored in Value or just leave it as a VarChar and work with it as a keyvalue store.

You also may want to move CompanyId off of the UserAttributeDefiniton table and into a cross reference table for future proofing.

Best Answer

Related Solutions

Database – SaaS model, 1DB per client

Database – How would you design a user database with custom fields

Related Topic