R – Arrange a grouped_df by group variable not working

dplyrgrouped-tabler

I have a data.frame that contains client names, years, and several revenue numbers from each year.

df <- data.frame(client = rep(c("Client A","Client B", "Client C"),3), 
                 year = rep(c(2014,2013,2012), each=3), 
                 rev = rep(c(10,20,30),3)
                )

I want to end up with a data.frame that aggregates the revenue by client and year. I then want to sort the data.frame by year then by descending revenue.

library(dplyr)
df1 <- df %>% 
        group_by(client, year) %>%
        summarise(tot = sum(rev)) %>%
        arrange(year, desc(tot))

However, when using the code above the arrange() function doesn't change the order of the grouped data.frame at all. When I run the below code and coerce to a normal data.frame it works.

   library(dplyr)
    df1 <- df %>% 
            group_by(client, year) %>%
            summarise(tot = sum(rev)) %>%
            data.frame() %>%
            arrange(year, desc(tot))

Am I missing something or will I need to do this every time when trying to arrange a grouped_df by a grouped variable?

R Version: 3.1.1
dplyr package version: 0.3.0.2

EDIT 11/13/2017:
As noted by lucacerone, beginning with dplyr 0.5, arrange once again ignores groups when sorting. So my original code now works in the way I initially expected it would.

arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.

Best Answer

Try switching the order of your group_by statement:

df %>% 
  group_by(year, client) %>%
  summarise(tot = sum(rev)) %>%
  arrange(year, desc(tot))

I think arrange is ordering within groups; after summarize, the last group is dropped, so this means in your first example it's arranging rows within the client group. Switching the order to group_by(year, client) seems to fix it because the client group gets dropped after summarize.

Alternatively, there is the ungroup() function

df %>% 
  group_by(client, year) %>%
  summarise(tot = sum(rev)) %>%
  ungroup() %>%
  arrange(year, desc(tot))

Edit, @lucacerone: since dplyr 0.5 this does not work anymore:

Breaking changes arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.

Related Solutions

R – How to sum a variable by group

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

R – the advantage of using DataContractAttribute over SerializableAttribute

DataContractAttribute gives you more control over what gets sent over the wire, so you can opt to only send the necessary fields of a given entity. Serializable uses platform serialization, which assumes .NET and the same (or similar) versions of the types on both ends of the wire- it (usually) serializes all the private members, state, etc. DCS is intended for a lightweight XML-ish representation that you can have some control over, and XmlSerializer is for an XML format that you can have very fine control over (attribute data, etc).

Best Answer

Related Solutions

R – How to sum a variable by group

R – the advantage of using DataContractAttribute over SerializableAttribute

Related Topic