From what I've seen the "did you profile?" question always comes after "why does this run so slow?" so the "benchmarking" has been done and the result was "too slow" and now we're trying to figure out why it's running so slowly so we go and "profile" it.
Real life is usually more complicated. How fast your software is depends on the architectural decisions you make, algorithms you choose, whether or not you've correctly identified and dealt with various bottlenecks and system constraints. Getting stuck optimizing a system that isn't designed for performance is an easy trap to fall into and can suck away huge amounts of time for little reward. On the other hand, not every software has high performance as a requirement.
Profiling and optimizing before you benchmark, i.e. before you know whether or not the performance is adequate is truly falling into the premature optimization scenario.
I like this quote from Wikipedia:
“The First Rule of Program
Optimization: Don't do it. The Second
Rule of Program Optimization (for
experts only!): Don't do it yet.” -
Michael A. Jackson
Short answer
Yes, you can write a meaningful benchmark of a studied case, if you do it with care, and understand that if it's relevant for the particular case, it may not be for other cases. This is equally true when comparing the databases of the same type (relational database vs. another relational database) or the databases of different types.
No, you cannot write a benchmark which will magically prove that a specific database is ways better than another in every case, for every application.
Long answer
It's definitely possible to say that "moving from a database to another improved our site performance".
You measure the performance of the previous database through profiling or runtime statistics by gathering enough information about the queries and how fast they are.
You move the application to the new database.
You do the same measures.
You compare.
For example, if the full list of 3 182 432 products loaded in 2.834 s. on an old database and loads in 0.920 s. on a new database, given that in both cases, the application has an empty cache, it's a win: the new database improved your site performance concerning this query.
Now, as any performance metric, it's biased:
Agreed, the new query is faster. But wait, your DBA didn't know how to use the database you had before, so the query which loads all the products is not optimized. If you rewrite it like that, you would be able to load those products in 0.855 s. instead of 2.834.
Ok, you have a better result. But don't you think that it's unfair to compare a database with fresh data just flushed to a 10 years old database for which the last maintenance plan was run three years ago? By the way, don't you think you should have updated the database product at least once during the last four years?
Some queries are faster. Some are slower. How do you compute the average result to know that you gained performance overall when moving to the new database? Ok, the time you load all 3 182 432 products is faster. But does it matter, while the query is executed on the website only in a rare case when an administrator is performing some specific task he performed only two times in the last ten years? On the other hand, executing all queries on the home page for a fresh user wastes 0.281 s. with the new database, when it was 0.207 s. with the old database. This result matters much more, especially since those queries cannot be cached for a long time, and are executed tens of thousands of times per day.
Both databases must be tested on the same servers, same hardware, same structure. For example, you can't test one database on a single hard drive, and the other one on a RAID1 of two SSDs. When you migrate a large project to a new database, there are chances that you'll just host the new database on hundred other newly deployed rack servers, when the previous database will still remain on the previous machines.
To summarize, you can benchmark the database queries of an application, and obtain precise metrics. But then, you have to give a meaning to numbers. At this state, it's tempting to say that you gained site performance: otherwise, management would be angry to learn that you've spent thousands of dollars and months of work just to make things slower.
The most terrible mistake is to take those conclusions from the benchmarks and to conclude some stupidity like "Microsoft SQL Server is three times faster than Oracle": saying this is like saying that "Java is better than PHP". Define better. Better in what cases? For what sort of applications? For what team of developers?
More you interpret and generalize, more the thing becomes irrelevant and meaningless.
The query select [...]
you can find in the revision #832 in file ProductFactory.cs
, line 117 executes under 0.5 s. with the new database when tested under the conditions specified in the non-functional requirements annex M, case 3. This allows passing the non-functional requirement 527 (see page 80, revision 9). The same requirement was not satisfied with the previous database, when the test results was in the range 0.9..1.3 s. in the same conditions.
is meaningful for a developer, and precise enough to know what was tested, how, and what was the results. This answers your question number 2.
Sadly, it doesn't make any sense for the management. Instead:
Migrating our product from MySQL to the newest version of Microsoft SQL Server improved the overall performance of our product by five, reducing at the same time the costs by two and the environmental footprint by three. We believe that migrating all our applications to Microsoft SQL Server the next year will give even better results and increase our market competitiveness.
is a pure marketing jibber-jabber, and, technically, doesn't mean anything, but surprisingly has a value for the management and marketing departments.
Finally, can we compare different types of databases? I would say it's totally possible. Let's say I have a website hosting large photos. Those photos are stored in varbinary(max)
in Microsoft SQL Server 2005 (so I can't use filestream
). I am concerned about the performance when loading those photos, so I decide to store the photos as files instead, using the file system as my new database. First, those files are stored on the same machine than the database. I profile the new solution, and obtain the result that shows that in my case, files are loaded 4% faster from file system than from Microsoft SQL Server. The benchmark is very clear. Now I can think about deploying a dedicated server optimized for direct file storage, rather than using the server optimized for Microsoft SQL Server.
Best Answer
The lowest timing might indeed represent the "true" timing without outside interference, or it might be a measurement error. E.g. the boosting behaviour of a CPU might speed up the first run in a larger benchmarking suite, or a less congested network might speed up a net-dependent benchmark during certain times of day. Similarly, the highest timing might represent the true worst case, or non-representative interference. E.g. a background service might have started during the benchmark, or a SMR hard drive cache is being flushed during an IO-based benchmark.
Such interference indicates a flawed experimental design that fails to control for these influences, but it's not always possible or economical to design the perfect experiment. So we have to deal with the messy real-world data that we have.
Statistics like the mean (average) of some values is very sensitive to outliers. It is thus common to use a trimmed mean where we remove outliers, in the hopes of getting closer to the "true" mean. Various methods for determining outliers exist, with the simplest approach being to remove the top/bottom p%, for some value p. Another option is to use techniques like bootstrapping that let us estimate how reliable the estimate is: instead of removing top/bottom observations, we remove random observations and repeat the calculations multiple times.
However, it is not generally necessary to calculate the mean run time when doing benchmarking. For comparing the typical behaviour, we can use measures like the median or other quantiles. Especially when measuring latencies, quantiles like the 95%-percentile are often used (meaning: 95% of measurements were this fast or faster).
It is also unnecessary to calculate the mean when trying to determine whether one program is significantly faster than another. Instead of parametric statistical tests that require such parameters to be estimated from the sample, it is possible to use non-parametric tests. E.g. the Mann–Whitney Rank Sum Test only considers the order of values from two samples, not their actual values. While this is more flexible and more rigorous, this does lose some statistical power (you might not be able to detect a significant difference even if it exists).