Apache-spark – How to find spark master URL on Amazon EMR

amazon-emrapache-sparkspark-streaming

I am new to spark and trying to install spark on Amazon cluster with version 1.3.1. when i do

SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("local[2]");

it does work for me , however i came to know that this is for testing purpose i can set local[2]

When i tried to use cluster mode i changed it to

SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("spark://localhost:7077");

with this i am getting below error

Tried to associate with unreachable remote address [akka.tcp://sparkMaster@localhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused
15/06/10 15:22:21 INFO client.AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster@localhost:7077/user/Master..

Could someone please let me how to set the master url.

Best Answer

If you are using the bootstrap action from https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark the configuration is setup for Spark on YARN. So just set master to yarn-client or yarn-cluster. Be sure to define the number of executors with memory and cores. More details about Spark on YARN at https://spark.apache.org/docs/latest/running-on-yarn.html

Addition regarding executor settings for memory and core sizing:

Take a look at the default YARN node manager configs for each type at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html, specifically yarn.scheduler.maximum-allocation-mb. You can determine the number of cores from the basic EC2 info url (http://aws.amazon.com/ec2/instance-types/). The max size of the executor memory has to fit within the max allocation less Spark's overhead and in increments of 256MB. A good description of this calculation is at http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/. Don't forget that a little over half the executor memory can be used for RDD cache.

Related Topic