I have a very big pyspark.sql.dataframe.DataFrame named df.
I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)
In pandas, I could make just
indexes=[2,3,6,7]
df[indexes]
Here I want something similar, (and without converting dataframe to pandas)
The closest I can get to is:
-
Enumerating all the objects in the original dataframe by:
indexes=np.arange(df.count()) df_indexed=df.withColumn('index', indexes)
- Searching for values I need using where() function.
QUESTIONS:
- Why it doesn't work and how to make it working? How to add a row to a dataframe?
-
Would it work later to make something like:
indexes=[2,3,6,7] df1.where("index in indexes").collect()
-
Any faster and simpler way to deal with it?
Best Answer
It doesn't work because:
withColumn
should be aColumn
not a collection.np.array
won't work here"index in indexes"
as a SQL expression towhere
indexes
is out of scope and it is not resolved as a valid identifierPySpark >= 1.4.0
You can add row numbers using respective window function and query usingColumn.isin
method or properly formated query string:It looks like window functions called without
PARTITION BY
clause move all data to the single partition so above may be not the best solution after all.Not really. Spark DataFrames don't support random row access.
PairedRDD
can be accessed usinglookup
method which is relatively fast if data is partitioned usingHashPartitioner
. There is also indexed-rdd project which supports efficient lookups.Edit:
Independent of PySpark version you can try something like this: