I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.
Currently, I do the following:
data = pandas.read_csv('mydata.csv')
which gives something like:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
I'd like to slice this dataframe in two dataframes: one containing the columns a
and b
and one containing the columns c
, d
and e
.
It is not possible to write something like
observations = data[:'c']
features = data['c':]
I'm not sure what the best method is. Do I need a pd.Panel
?
By the way, I find dataframe indexing pretty inconsistent: data['a']
is permitted, but data[0]
is not. On the other side, data['a':]
is not permitted but data[0:]
is.
Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]
Best Answer
2017 Answer - pandas 0.20: .ix is deprecated. Use .loc
See the deprecation in the docs
.loc
uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with.loc
includes the last element..loc
accepts the same slice notation that Python lists do for both row and columns. Slice notation beingstart:stop:step
You can slice by rows and columns. For instance, if you have 5 rows with labels
v
,w
,x
,y
,z