Python – using foreachRDD and foreach to iterate over an rdd in pyspark

apache-sparkpysparkpython

Questions for Spark 1.6.1, pyspark

I have streaming data coming in as like

{"event":4,"Userid":12345,"time":123456789,"device_model":"iPhone OS", "some_other_property": "value", "row_key": 555}

I have a function that writes to HBase called writeToHBase(rdd), expecting an rdd that has tuple in the following structure:

(rowkey, [rowkey, column-family, key, value])

As you can see from the input format, I have to take my original dataset and iterate over all keys, sending each key/value pair with a send function call.

From reading the spark streaming programming guide, section "Design Patterns for using foreachRDD" http://spark.apache.org/docs/latest/streaming-programming-guide.html#tab_python_13

It seems that its recommended to use foreachRDD when doing something external to the dataset. In my case, I want to write data to HBase over the network, so I use foreachRDD on my streaming data and call the function that will handle sending the data:

stream.foreachRDD(lambda k: process(k))

My understanding of spark functions is pretty limited right now, so I'm unable to figure out a way to iterate on my original dataset to use my write function. if it was a python iterable, i'd be able to do this:

def process(rdd):
    for key, value in my_rdd.iteritems():
        writeToHBase(sc.parallelize(rowkey, [rowkey, 'column-family', key, value]))

where rowkey would have be obtained by finding it in the rdd itself

rdd.map(lambda x: x['rowkey'])

How do I accomplish what process() is meant to do in pyspark? I see some examples use foreach, but I'm not quite able to get it to do what I want.

Best Answer

why do you want to iterate over rdd while your writeToHBase function expects a rdd as arguement. Simply call writeToHBase(rdd) in your process function, that's it.

If you need to fetch every record from the rdd you can call

def processRecord(record):
        print(record)   
rdd.foreach(processRecord)

In processRecord function you will get single record to process.

Related Solutions

Python – Iterating over dictionaries using ‘for’ loops

key is just a variable name.

for key in d:

will simply loop over the keys in the dictionary, rather than the keys and values. To loop over both key and value you can use the following:

For Python 3.x:

for key, value in d.items():

For Python 2.x:

for key, value in d.iteritems():

To test for yourself, change the word key to poop.

In Python 3.x, iteritems() was replaced with simply items(), which returns a set-like view backed by the dict, like iteritems() but even better. This is also available in 2.7 as viewitems().

The operation items() will work for both 2 and 3, but in 2 it will return a list of the dictionary's (key, value) pairs, which will not reflect changes to the dict that happen after the items() call. If you want the 2.x behavior in 3.x, you can call list(d.items()).

Python – How to iterate over files in a given directory

Original answer:

import os

for filename in os.listdir("/path/to/dir/"):
    if filename.endswith(".asm") or filename.endswith(".py"): 
         # print(os.path.join(directory, filename))
        continue
    else:
        continue

Python 3.6 version of the above answer, using os - assuming that you have the directory path as a str object in a variable called directory_in_str:

import os

directory = os.fsencode(directory_in_str)
    
for file in os.listdir(directory):
     filename = os.fsdecode(file)
     if filename.endswith(".asm") or filename.endswith(".py"): 
         # print(os.path.join(directory, filename))
         continue
     else:
         continue

Or recursively, using pathlib:

from pathlib import Path

pathlist = Path(directory_in_str).glob('**/*.asm')
for path in pathlist:
     # because path is object not string
     path_in_str = str(path)
     # print(path_in_str)

Use rglob to replace glob('**/*.asm') with rglob('*.asm')
- This is like calling Path.glob() with '**/' added in front of the given relative pattern:

from pathlib import Path

pathlist = Path(directory_in_str).rglob('*.asm')
for path in pathlist:
     # because path is object not string
     path_in_str = str(path)
     # print(path_in_str)

Best Answer

Related Solutions

Python – Iterating over dictionaries using ‘for’ loops

Python – How to iterate over files in a given directory

Related Topic