Maybe a bit of example code will help: Notice the difference in the call signatures of foo
, class_foo
and static_foo
:
class A(object):
def foo(self, x):
print(f"executing foo({self}, {x})")
@classmethod
def class_foo(cls, x):
print(f"executing class_foo({cls}, {x})")
@staticmethod
def static_foo(x):
print(f"executing static_foo({x})")
a = A()
Below is the usual way an object instance calls a method. The object instance, a
, is implicitly passed as the first argument.
a.foo(1)
# executing foo(<__main__.A object at 0xb7dbef0c>, 1)
With classmethods, the class of the object instance is implicitly passed as the first argument instead of self
.
a.class_foo(1)
# executing class_foo(<class '__main__.A'>, 1)
You can also call class_foo
using the class. In fact, if you define something to be
a classmethod, it is probably because you intend to call it from the class rather than from a class instance. A.foo(1)
would have raised a TypeError, but A.class_foo(1)
works just fine:
A.class_foo(1)
# executing class_foo(<class '__main__.A'>, 1)
One use people have found for class methods is to create inheritable alternative constructors.
With staticmethods, neither self
(the object instance) nor cls
(the class) is implicitly passed as the first argument. They behave like plain functions except that you can call them from an instance or the class:
a.static_foo(1)
# executing static_foo(1)
A.static_foo('hi')
# executing static_foo(hi)
Staticmethods are used to group functions which have some logical connection with a class to the class.
foo
is just a function, but when you call a.foo
you don't just get the function,
you get a "partially applied" version of the function with the object instance a
bound as the first argument to the function. foo
expects 2 arguments, while a.foo
only expects 1 argument.
a
is bound to foo
. That is what is meant by the term "bound" below:
print(a.foo)
# <bound method A.foo of <__main__.A object at 0xb7d52f0c>>
With a.class_foo
, a
is not bound to class_foo
, rather the class A
is bound to class_foo
.
print(a.class_foo)
# <bound method type.class_foo of <class '__main__.A'>>
Here, with a staticmethod, even though it is a method, a.static_foo
just returns
a good 'ole function with no arguments bound. static_foo
expects 1 argument, and
a.static_foo
expects 1 argument too.
print(a.static_foo)
# <function static_foo at 0xb7d479cc>
And of course the same thing happens when you call static_foo
with the class A
instead.
print(A.static_foo)
# <function static_foo at 0xb7d479cc>
Best Answer
The deprecated low_memory option
The
low_memory
option is not properly deprecated, but it should be, since it does not actually do anything differently[source]The reason you get this
low_memory
warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.Dtype Guessing (very bad)
Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.
Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.
Specifying dtypes (should always be done)
adding
to the
pd.read_csv()
call will make pandas know when it starts reading the file, that this is only integers.Also worth noting is that if the last line in the file would have
"foobar"
written in theuser_id
column, the loading would crash if the above dtype was specified.Example of broken data that breaks when dtypes are defined
dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html
What dtypes exists?
We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.
Pandas extends this set of dtypes with its own:
'datetime64[ns, <tz>]'
Which is a time zone aware timestamp.'category' which is essentially an enum (strings represented by integer keys to save
'period[]' Not to be confused with a timedelta, these objects are actually anchored to specific time periods
'Sparse', 'Sparse[int]', 'Sparse[float]' is for sparse data or 'Data that has a lot of holes in it' Instead of saving the NaN or None in the dataframe it omits the objects, saving space.
'Interval' is a topic of its own but its main use is for indexing. See more here
'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64' are all pandas specific integers that are nullable, unlike the numpy variant.
'string' is a specific dtype for working with string data and gives access to the
.str
attribute on the series.'boolean' is like the numpy 'bool' but it also supports missing data.
Read the complete reference here:
Pandas dtype reference
Gotchas, caveats, notes
Setting
dtype=object
will silence the above warning, but will not make it more memory efficient, only process efficient if anything.Setting
dtype=unicode
will not do anything, since to numpy, aunicode
is represented asobject
.Usage of converters
@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering
'foobar'
in a column specified asint
. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.