There are two built-in functions that help you identify the type of an object. You can use type()
if you need the exact type of an object, and isinstance()
to check an object’s type against something. Usually, you want to use isinstance()
most of the times since it is very robust and also supports type inheritance.
To get the actual type of an object, you use the built-in type()
function. Passing an object as the only parameter will return the type object of that object:
>>> type([]) is list
True
>>> type({}) is dict
True
>>> type('') is str
True
>>> type(0) is int
True
This of course also works for custom types:
>>> class Test1 (object):
pass
>>> class Test2 (Test1):
pass
>>> a = Test1()
>>> b = Test2()
>>> type(a) is Test1
True
>>> type(b) is Test2
True
Note that type()
will only return the immediate type of the object, but won’t be able to tell you about type inheritance.
>>> type(b) is Test1
False
To cover that, you should use the isinstance
function. This of course also works for built-in types:
>>> isinstance(b, Test1)
True
>>> isinstance(b, Test2)
True
>>> isinstance(a, Test1)
True
>>> isinstance(a, Test2)
False
>>> isinstance([], list)
True
>>> isinstance({}, dict)
True
isinstance()
is usually the preferred way to ensure the type of an object because it will also accept derived types. So unless you actually need the type object (for whatever reason), using isinstance()
is preferred over type()
.
The second parameter of isinstance()
also accepts a tuple of types, so it’s possible to check for multiple types at once. isinstance
will then return true, if the object is of any of those types:
>>> isinstance([], (tuple, list, set))
True
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__
syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy()
method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc
, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc
along with get_loc
function of columns
method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc
.
Best Answer
You have four main options for converting types in pandas:
to_numeric()
- provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See alsoto_datetime()
andto_timedelta()
.)astype()
- convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).infer_objects()
- a utility method to convert object columns holding Python objects to a pandas type if possible.convert_dtypes()
- convert DataFrame columns to the "best possible" dtype that supportspd.NA
(pandas' object to indicate a missing value).Read on for more detailed explanations and usage of each of these methods.
1.
to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use
pandas.to_numeric()
.This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.
Basic usage
The input to
to_numeric()
is a Series or a single column of a DataFrame.As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
You can also use it to convert multiple columns of a DataFrame via the
apply()
method:As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
to_numeric()
also takes anerrors
keyword argument that allows you to force non-numeric values to beNaN
, or simply ignore columns containing these values.Here's an example using a Series of strings
s
which has the object dtype:The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to
NaN
as follows using theerrors
keyword argument:The third option for
errors
is just to ignore the operation if an invalid value is encountered:This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type. In that case just write:
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with
to_numeric()
will give you either aint64
orfloat64
dtype (or whatever integer width is native to your platform).That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like
float32
, orint8
?to_numeric()
gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple seriess
of integer type:Downcasting to 'integer' uses the smallest possible integer that can hold the values:
Downcasting to 'float' similarly picks a smaller than normal floating type:
2.
astype()
The
astype()
method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to the any other.Basic usage
Just pick a type: you can use a NumPy dtype (e.g.
np.int16
), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).Call the method on the object you want to convert and
astype()
will try and convert it for you:Notice I said "try" - if
astype()
does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have aNaN
orinf
value you'll get an error trying to convert it to an integer.As of pandas 0.20.0, this error can be suppressed by passing
errors='ignore'
. Your original object will be return untouched.Be careful
astype()
is powerful, but it will sometimes convert values "incorrectly". For example:These are small integers, so how about converting to an unsigned 8-bit type to save memory?
The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!
Trying to downcast using
pd.to_numeric(s, downcast='unsigned')
instead could help prevent this error.3.
infer_objects()
Version 0.21.0 of pandas introduced the method
infer_objects()
for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
Using
infer_objects()
, you can change the type of column 'a' to int64:Column 'b' has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use
df.astype(int)
instead.4.
convert_dtypes()
Version 1.0 and above includes a method
convert_dtypes()
to convert Series and DataFrame columns to the best possible dtype that supports thepd.NA
missing value.Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to
Int64
, a column of NumPyint32
values will become the pandas dtypeInt32
.With our
object
DataFramedf
, we get the following result:Since column 'a' held integer values, it was converted to the
Int64
type (which is capable of holding missing values, unlikeint64
).Column 'b' contained string objects, so was changed to pandas'
string
dtype.By default, this method will infer the type from object values in each column. We can change this by passing
infer_objects=False
:Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran
infer_dtype
) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.