R – ggplot2 box-whisker plot: show 95% confidence intervals & remove outliers

ggplot2plotr

I'd like a box plot that looks just like the one below. But instead of the default, I'd like to present (1) 95% confidence intervals and (2) without the outliers.

The 95% confidence intervals could mean (i) extending the boxes and removing the whiskers, or (ii) having just a mean and whiskers, and removing the boxes. Or if people have other ideas for presenting 95% confidence intervals in a plot like this, I'm open to suggestions. The final goals is to show mean and conf intervals for data across multiple categories on the same plot.

set.seed(1234)
df <- data.frame(cond = factor( rep(c("A","B"), each=200) ), 
                   rating = c(rnorm(200),rnorm(200, mean=.8))
ggplot(df, aes(x=cond, y=rating, fill=cond)) + geom_boxplot() + 
    guides(fill=FALSE) + coord_flip()

enter image description here

Image and code source: http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/

Best Answer

I've used the following to show a 95% interval. Based on what I've read it's not an uncommon use of box and whisker, but it's not the default, so you do need to make it clear what you're showing in the graph.

quantiles_95 <- function(x) {
  r <- quantile(x, probs=c(0.05, 0.25, 0.5, 0.75, 0.95))
  names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
  r
}

ggplot(df, aes(x=cond, y=rating, fill=cond)) +
    guides(fill=F) +
    coord_flip() +
    stat_summary(fun.data = quantiles_95, geom="boxplot")

enter image description here

Instead of use geom_boxplot, use stat_summary with a custom function that specifies the limits you want to use:

  • "ymin" is the lower limit of the lower whisker
  • "lower" is the lower limit of the lower box
  • "middle" is the middle of the box (typically the median)
  • "upper" is the upper limit of the upper box
  • "ymax" is the upper limit of the upper whisker.

In the provided function (quantiles_95), the builtin quantile function is used with custom probs argument. As given, the whiskers will span 90% of your data: from the bottom 5% to the upper 95%. The boxes will span the middle two quartiles, as usual, from 25% to 75%.

You can always change the custom function to choose different quantiles (or even to not use quantiles), but you need to be very careful with this. As pointed out in a comment, there is a certain expectation when one sees a box and whisker plot. If you're using the same shape plot to convey different information, you're likely to confuse people.

If you want to get rid of the whiskers, make the "ymin" equal to "lower" and the "ymax" equal to "upper". If you want to have all whiskers and no box, set "upper" and "lower" both equal to "middle" (or just use geom_errorbars).

Related Topic