You can use df.loc[i]
, where the row with index i
will be what you specify it to be in the dataframe.
>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
lib qty1 qty2
0 name0 3 3
1 name1 2 4
2 name2 2 8
3 name3 2 1
4 name4 9 6
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__
syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy()
method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc
, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc
along with get_loc
function of columns
method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc
.
Best Answer
Question 1
How can I perform aggregation with Pandas?
Expanded aggregation documentation.
Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.
Some common aggregating functions are tabulated below:
Aggregation by filtered columns and Cython implemented functions:
An aggregate function is used for all columns without being specified in the
groupby
function, here theA, B
columns:You can also specify only some columns used for aggregation in a list after the
groupby
function:Same results by using function
DataFrameGroupBy.agg
:For multiple functions applied for one column use a list of
tuple
s - names of new columns and aggregated functions:If want to pass multiple functions is possible pass
list
oftuple
s:Then get
MultiIndex
in columns:And for converting to columns, flattening
MultiIndex
usemap
withjoin
:Another solution is pass list of aggregate functions, then flatten
MultiIndex
and for another columns names usestr.replace
:If want specified each column with aggregated function separately pass
dictionary
:You can pass custom function too:
Question 2
No DataFrame after aggregation! What happened?
Aggregation by two or more columns:
First check the
Index
andtype
of a Pandas object:There are two solutions for how to get
MultiIndex Series
to columns:as_index=False
Series.reset_index
:If group by one column:
... get
Series
withIndex
:And the solution is the same like in the
MultiIndex Series
:Question 3
How can I aggregate mainly strings columns (to
list
s,tuple
s,strings with separator
)?Instead of an aggregation function, it is possible to pass
list
,tuple
,set
for converting the column:An alternative is use
GroupBy.apply
:For converting to strings with a separator, use
.join
only if it is a string column:If it is a numeric column, use a lambda function with
astype
for converting tostring
s:Another solution is converting to strings before
groupby
:For converting all columns, don't pass a list of column(s) after
groupby
. There isn't any columnD
, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.So it's necessary to convert all columns into strings, and then get all columns:
Question 4
How can I aggregate counts?
Function
GroupBy.size
forsize
of each group:Function
GroupBy.count
excludes missing values:This function should be used for multiple columns for counting non-missing values:
A related function is
Series.value_counts
. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludesNaN
s values by default.If you want same output like using function
groupby
+size
, addSeries.sort_index
:Question 5
How can I create a new column filled by aggregated values?
Method
GroupBy.transform
returns an object that is indexed the same (same size) as the one being grouped.See the Pandas documentation for more information.