The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__
syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy()
method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc
, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc
along with get_loc
function of columns
method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc
.
RENAME SPECIFIC COLUMNS
Use the df.rename()
function and refer the columns to be renamed. Not all the columns have to be renamed:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
# Or rename the existing DataFrame (rather than creating a copy)
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
Minimal Code Example
df = pd.DataFrame('x', index=range(3), columns=list('abcde'))
df
a b c d e
0 x x x x x
1 x x x x x
2 x x x x x
The following methods all work and produce the same output:
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis=1) # new method
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis='columns')
df2 = df.rename(columns={'a': 'X', 'b': 'Y'}) # old method
df2
X Y c d e
0 x x x x x
1 x x x x x
2 x x x x x
Remember to assign the result back, as the modification is not-inplace. Alternatively, specify inplace=True
:
df.rename({'a': 'X', 'b': 'Y'}, axis=1, inplace=True)
df
X Y c d e
0 x x x x x
1 x x x x x
2 x x x x x
From v0.25, you can also specify errors='raise'
to raise errors if an invalid column-to-rename is specified. See v0.25 rename()
docs.
REASSIGN COLUMN HEADERS
Use df.set_axis()
with axis=1
and inplace=False
(to return a copy).
df2 = df.set_axis(['V', 'W', 'X', 'Y', 'Z'], axis=1, inplace=False)
df2
V W X Y Z
0 x x x x x
1 x x x x x
2 x x x x x
This returns a copy, but you can modify the DataFrame in-place by setting inplace=True
(this is the default behaviour for versions <=0.24 but is likely to change in the future).
You can also assign headers directly:
df.columns = ['V', 'W', 'X', 'Y', 'Z']
df
V W X Y Z
0 x x x x x
1 x x x x x
2 x x x x x
Best Answer
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
What this post (and other posts by me on this thread) will not go through:
Enough talk - just show me how to use
merge
!Setup & Basics
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
To perform an INNER JOIN, call
merge
on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.This returns only rows from
left
andright
which share a common key (in this example, "B" and "D).A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying
how='left'
.Carefully note the placement of NaNs here. If you specify
how='left'
, then only keys fromleft
are used, and missing data fromright
is replaced by NaN.And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify
how='right'
:Here, keys from
right
are used, and missing data fromleft
is replaced by NaN.Finally, for the FULL OUTER JOIN, given by
specify
how='outer'
.This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from
left
only,Where,
And similarly, for a RIGHT-Excluding JOIN,
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
Different names for key columns
If the key columns are named differently—for example,
left
haskeyLeft
, andright
haskeyRight
instead ofkey
—then you will have to specifyleft_on
andright_on
as arguments instead ofon
:Avoiding duplicate key column in output
When merging on
keyLeft
fromleft
andkeyRight
fromright
, if you only want either of thekeyLeft
orkeyRight
(but not both) in the output, you can start by setting the index as a preliminary step.Contrast this with the output of the command just before (that is, the output of
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll noticekeyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.Merging only a single column from one of the
DataFrames
For example, consider
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
If you're doing a LEFT OUTER JOIN, a more performant solution would involve
map
:As mentioned, this is similar to, but faster than
Merging on multiple columns
To join on more than one column, specify a list for
on
(orleft_on
andright_on
, as appropriate).Or, in the event the names are different,
Other useful
merge*
operations and functionsMerging a DataFrame with Series on index: See this answer.
Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on
merge
,join
, andconcat
as well as the links to the function specifications.Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins *
Index-based joins
Generalizing to multiple DataFrames
Cross join
*You are here.