Simple Popularity Algorithm

algorithmpopularity

Summary

As Ted Jaspers wisely pointed out, the methodology I described in the original proposal back in 2012 is actually a special case of an exponential moving average. The beauty of this approach is that it can be calculated recursively, meaning you only need to store a single popularity value with each object and then you can recursively adjust this value when an event occurs. There's no need to record every event.

This single popularity value represents all past events (within the limits of the data type being used), but older events begin to matter exponentially less as new events are factored in. This algorithm will adapt to different time scales and will respond to varying traffic volumes. Each time an event occurs, the new popularity value can be calculated using the following formula:

(a * t) + ((1 - a) * p)

  • a — coefficient between 0 and 1 (higher values discount older events faster)
  • t — current timestamp
  • p — current popularity value (e.g. stored in a database)

Reasonable values for a will depend on your application. A good starting place is a=2/(N+1), where N is the number of events that should significantly affect the outcome. For example, on a low-traffic website where the event is a page view, you might expect hundreds of page views over a period of a few days. Choosing N=100 (a≈0.02) would be a reasonable choice. For a high-traffic website, you might expect millions of page views over a period of a few days, in which case N=1000000 (a≈0.000002) would be more reasonable. The value for a will likely need to be gradually adjusted over time.

To illustrate how simple this popularity algorithm is, here's an example of how it can be implemented in Craft CMS in 2 lines of Twig markup:

{% set popularity = (0.02 * date().timestamp) + (0.98 * entry.popularity) %}
{% do entry.setFieldValue("popularity", popularity) %}

Notice that there's no need to create new database tables or store endless event records in order to calculate popularity.

One caveat to keep in mind is that exponential moving averages have a spin-up interval, so it takes a few recursions before the value can be considered accurate. This means the initial condition is important. For example, if the popularity of a new item is initialized using the current timestamp, the item immediately becomes the most popular item in the entire set before eventually settling down into a more accurate position. This might be desirable if you want to promote new content. Alternatively, you may want content to work its way up from the bottom, in which case you could initialize it with the timestamp of when the application was first launched. You could also find a happy medium by initializing the value with an average of all popularity values in the database, so it starts out right in the middle.


Original Proposal

There are plenty of suggested algorithms for calculating popularity based on an item's age and the number of votes, clicks, or purchases an item receives. However, the more robust methods I've seen often require overly complex calculations and multiple stored values which clutter the database. I've been contemplating an extremely simple algorithm that doesn't require storing any variables (other than the popularity value itself) and requires only one simple calculation. It's ridiculously simple:

p = (p + t) / 2

Here, p is the popularity value stored in the database and t is the current timestamp. When an item is first created, p must be initialized. There are two possible initialization methods:

  1. Initialize p with the current timestamp t
  2. Initialize p with the average of all p values in the database

Note that initialization method (1) gives recently added items a clear advantage over historical items, thus adding an element of relevance. On the other hand, initialization method (2) treats new items as equals when compared to historical items.

Let's say you use initialization method (1) and initialize p with the current timestamp. When the item receives its first vote, p becomes the average of the creation time and the vote time. Thus, the popularity value p still represents a valid timestamp (assuming you round to the nearest integer), but the actual time it represents is abstracted.

With this method, only one simple calculation is required and only one value needs to be stored in the database (p). This method also prevents runaway values, since a given item's popularity can never exceed the current time.

An example of the algorithm at work over a period of 1 day: http://jsfiddle.net/q2UCn/
An example of the algorithm at work over a period of 1 year: http://jsfiddle.net/tWU9y/

If you expect votes to steadily stream in at sub-second intervals, then you will need to use a microsecond timestamp, such as the PHP microtime() function. Otherwise, a standard UNIX timestamp will work, such as the PHP time() function.

Now for my question: do you see any major flaws with this approach?

Best Answer

I think this is a very good approach, given its simplicity. A very interesting result.

I made a quick set of calculations and found that this algorithm does seem to understand what "popularity" means. Its problem is that it has a clear tendency to favor recent votes like this:

Imagine we take the time and break it into discrete timestamp values ranging from 100 to 1000. Assume that at t=100 both A and B (two items) have the same P = 100.

    A gets voted 7 times on 200, 300, 400, 500, 600, 700 and 800
resulting on a final Pa(800) = 700 (aprox).

    B gets voted 4 times on 300, 500, 700 and 900 
resulting on a final Pb(900) = 712 (aprox).

When t=1000 comes, both A and B receive votes, so:

Pa(1000) = 850 with 8 votes
Pb(1000) = 856 with 5 votes

Why? because the algorithm allows an item to quickly beat historical leaders if it receives more recent votes (even if the item has fewer votes in total).

EDIT INCLUDING SIMULATION

The OP created a nice fiddle that I changed to get the following results:

http://jsfiddle.net/wBV2c/6/

Item A receives one vote each day from 1970 till 2012 (15339 votes)
Item B receives one vote each month from Jan to Jul 2012 (7 votes)

The result: B is more popular than A.