R – “Adding missing grouping variables” message in dplyr in R

dplyrr

I have a portion of my script that was running fine before, but recently has been producing an odd statement after which many of my other functions do not work properly. I am trying to select the 8th and 23rd positions in a ranked list of values for each site to find the 25th and 75th percentile values for each day in a year for each site for 30 years. My approach was as follows (adapted for the four line dataset – slice(3) would be slice(23) for my full 30 year dataset usually):

library(“dplyr”)

mydata

structure(list(station_number = structure(c(1L, 1L, 1L, 1L), .Label = "01AD002", class = "factor"), 
year = 1981:1984, month = c(1L, 1L, 1L, 1L), day = c(1L, 
1L, 1L, 1L), value = c(113, 8.329999924, 15.60000038, 149
)), .Names = c("station_number", "year", "month", "day", "value"), class = "data.frame", row.names = c(NA, -4L))    

  value <- mydata$value
  qu25 <- mydata %>% 
          group_by(month, day, station_number) %>% 
          arrange(desc(value)) %>% 
          slice(3) %>% 
          select(value)

Before, I would be left with a table that had one value per site to describe the 25th percentile (since the arrange function seems to order them highest to lowest). However, now when I run these lines, I get a message:

Adding missing grouping variables: `month`, `day`, `station_number`

This message doesn’t make sense to me, as the grouping variables are clearly present in my table. Also, again, this was working fine until recently. I have tried:

detatch(“plyr”) – since I have it loaded before dplyr
dplyr:: group_by – placing this directly in the group_by line
uninstalling and re-intstalling dplyr, although this was for another issue I was having

Any idea why I might be receiving this message and why it may have stopped working?

Thanks for any help.

Update: Added dput example with one site, but values for January 1st for multiple years. The hope would be that the positional value is returned once grouped, for instance slice(3) would hopefully return the 15.6 value for this smaller subset.

Best Answer

For consistency sake the grouping variables should be always present when defined earlier and thus are added when select(value) is executed. ungroup should resolve it:

qu25 <- mydata %>% 
  group_by(month, day, station_number) %>%
  arrange(desc(value)) %>% 
  slice(2) %>% 
  ungroup() %>%
  select(value)

The requested result is without warnings:

> mydata %>% 
+   group_by(month, day, station_number) %>%
+   arrange(desc(value)) %>% 
+   slice(2) %>% 
+   ungroup() %>%
+   select(value)
# A tibble: 1 x 1
  value
  <dbl>
1   113

Related Solutions

R – Grouping functions (tapply, by, aggregate) and the *apply family

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

 # Two dimensional matrix
 M <- matrix(seq(1,16), 4, 4)

 # apply min to rows
 apply(M, 1, min)
 [1] 1 2 3 4

 # apply max to columns
 apply(M, 2, max)
 [1]  4  8 12 16

 # 3 dimensional array
 M <- array( seq(32), dim = c(4,4,2))

 # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
 apply(M, 1, sum)
 # Result is one-dimensional
 [1] 120 128 136 144

 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
 apply(M, c(1,2), sum)
 # Result is two-dimensional
      [,1] [,2] [,3] [,4]
 [1,]   18   26   34   42
 [2,]   20   28   36   44
 [3,]   22   30   38   46
 [4,]   24   32   40   48

If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums.

lapply - When you want to apply a function to each element of a list in turn and get a list back.

This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.
```
 x <- list(a = 1, b = 1:3, c = 10:100) 
 lapply(x, FUN = length) 
 $a 
 [1] 1
 $b 
 [1] 3
 $c 
 [1] 91
 lapply(x, FUN = sum) 
 $a 
 [1] 1
 $b 
 [1] 6
 $c 
 [1] 5005
```
sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

If you find yourself typing unlist(lapply(...)), stop and consider sapply.
```
 x <- list(a = 1, b = 1:3, c = 10:100)
 # Compare with above; a named vector, not a list 
 sapply(x, FUN = length)  
 a  b  c   
 1  3 91

 sapply(x, FUN = sum)   
 a    b    c    
 1    6 5005 
```
In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:
```
 sapply(1:5,function(x) rnorm(3,x))
```
If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:
```
 sapply(1:5,function(x) matrix(x,2,2))
```
Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:
```
 sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
```
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code or want more type safety.

For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

 x <- list(a = 1, b = 1:3, c = 10:100)
 #Note that since the advantage here is mainly speed, this
 # example is only for illustration. We're telling R that
 # everything returned by length() should be an integer of 
 # length 1. 
 vapply(x, FUN = length, FUN.VALUE = 0L) 
 a  b  c  
 1  3 91

mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

This is multivariate in the sense that your function must accept multiple arguments.
```
 #Sums the 1st elements, the 2nd elements, etc. 
 mapply(sum, 1:5, 1:5, 1:5) 
 [1]  3  6  9 12 15
 #To do rep(1,4), rep(2,3), etc.
 mapply(rep, 1:4, 4:1)   
 [[1]]
 [1] 1 1 1 1

 [[2]]
 [1] 2 2 2

 [[3]]
 [1] 3 3

 [[4]]
 [1] 4
```

Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.

 Map(sum, 1:5, 1:5, 1:5)
 [[1]]
 [1] 3

 [[2]]
 [1] 6

 [[3]]
 [1] 9

 [[4]]
 [1] 12

 [[5]]
 [1] 15

rapply - For when you want to apply a function to each element of a nested list structure, recursively.

To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

 # Append ! to string, otherwise increment
 myFun <- function(x){
     if(is.character(x)){
       return(paste(x,"!",sep=""))
     }
     else{
       return(x + 1)
     }
 }

 #A nested list structure
 l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
           b = 3, c = "Yikes", 
           d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))


 # Result is named vector, coerced to character          
 rapply(l, myFun)

 # Result is a nested list like l, with values altered
 rapply(l, myFun, how="replace")

tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.

A vector:
```
 x <- 1:20
```
A factor (of the same length!) defining groups:
```
 y <- factor(rep(letters[1:5], each = 4))
```
Add up the values in x within each subgroup defined by y:
```
 tapply(x, y, sum)  
  a  b  c  d  e  
 10 26 42 58 74 
```
More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

Remove rows with all or some NAs (missing values) in data.frame

Also check complete.cases :

> final[complete.cases(final), ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

but using complete.cases is quite a lot more clear, and faster.

Best Answer

Related Solutions

R – Grouping functions (tapply, by, aggregate) and the *apply family

Remove rows with all or some NAs (missing values) in data.frame

Related Topic