R – Interpreting Density Plot in R

ggplot2kernel-densityr

I have a list of ages in days and I am looking to display them in years on a density plot.

I did this two ways – changing the labels on the x axis to years and by dividing the data by 365. These methods give me different density estimates:

df <- data.frame(id = 1:80000, age = rnorm(80000, 46, 5) * 365)

The first plot is generated using:

breaks <- seq(from = min(df$age), to = max(df$age), by = 10*365)
ggplot(data = df, aes(x = age)) + 
    geom_density(aes(y = ..density..)) + 
    scale_x_continuous(breaks= breaks, labels = floor(breaks/365))

enter image description here
The density displayed on the y-axis ranges from 0 to 0.0002

When I do this however (divide the ages by 365 to get years – not just change the x labels like above):

ggplot(data = df, aes(x = age/365)) + 
    geom_density(aes(y = ..density..))

The plot looks the same but the density ranges from 0 to 0.08
I am struggling to understand what is going on – why is the density different between the two plots?

enter image description here

Best Answer

The density is different in the two plots because in one case you have 365 times as many units horizontally, so the vertical units will need to be 1/365th those of the other plot, given that probability density functions (the areas under these curves) must sum to one.

This is easier to think about in terms of bins rather than density curves. If you have one bin replacing 365 bins, the probability of landing in the one bin is much higher than the average probability of landing in the individual bins.

For the specific sample data you provide, we can see the conversion between the vertical units by looking at the peaks of both functions:

> max(density(df$age)$y) # max of density in days, more horizontal units
[1] 0.0002178977
> df$ageinyears <- df$age/365 # create an age-in-years variable
> max(density(df$ageinyears)$y) # max density in years, fewer horizontals
[1] 0.07953267
> max(density(df$age)$y)*365 
[1] 0.07953267

The practical reason this is an issue in plotting (and possibly the main thrust of your question) is the function that is estimating the density for ggplot is inheriting the x argument from the parent aes(). So it does not know anything about the custom x-axis you are using. Rather than just changing the x-axis in your first plot, you could explicitly tell geom_density not to use the inherited x values:

ggplot(data = df, aes(x = age)) + 
    geom_density(aes(x = age/365, y = ..density..))
Related Topic