I am trying to deploy the sample Hadoop app provided by Google at https://github.com/GoogleCloudPlatform/solutions-google-compute-engine-cluster-for-hadoop on Google Cloud Platform.
I followed all the setup instructions given there step-by-step. I was able to setup the environment and start the cluster successfully. But I am not able to run the MapReduce part.
I am executing this command on my terminal:
./compute_cluster_for_hadoop.py mapreduce <project ID> <bucket name> [--prefix <prefix>]
--input gs://<input directory on Google Cloud Storage> \
--output gs://<output directory on Google Cloud Storage> \
--mapper sample/shortest-to-longest-mapper.pl \
--reducer sample/shortest-to-longest-reducer.pl \
--mapper-count 5 \
--reducer-count 1
And I am getting the following error:
sudo: unknown user: hadoop
sudo: unable to initialize policy plugin
Traceback (most recent call last):
File "./compute_cluster_for_hadoop.py", line 230, in <module>
main()
File "./compute_cluster_for_hadoop.py", line 226, in main
ComputeClusterForHadoop().ParseArgumentsAndExecute(sys.argv[1:])
File "./compute_cluster_for_hadoop.py", line 222, in ParseArgumentsAndExecute
params.handler(params)
File "./compute_cluster_for_hadoop.py", line 51, in MapReduce
gce_cluster.GceCluster(flags).StartMapReduce()
File "/home/ubuntu-gnome/Hadoop-sample-app/solutions-google-compute-engine-cluster-for-hadoop-master/gce_cluster.py", line 545, in StartMapReduce
input_dir, output_dir)
File "/home/ubuntu-gnome/Hadoop-sample-app/solutions-google-compute-engine-cluster-for-hadoop-master/gce_cluster.py", line 462, in _StartScriptAtMaster
raise RemoteExecutionError('Remote execution error')
gce_cluster.RemoteExecutionError: Remote execution error
Since I have followed all the steps given there as-it-is, I am not able to understand why this issue is arising?
Is the 'hadoop' user actually not created in the previous scripts executed, or there is a problem with user permissions? Or the problem is somewhere else?
Please help me with this error..!! I am stuck here and can't proceed further.
Best Answer
The setup process is normally expected to create the user 'hadoop' automatically; it's done inside startup-script.sh on line 75-76:
It's possible that some portion of the setup actually failed.
That said, the sample you're referencing, while still useful as a starting point if you're writing your own Python application which interacts with the GCE API directly, is deprecated as a way to deploy Hadoop on Google Compute Engine. If you actually want to use Hadoop, you should use the Google-supported deployment tool bdutil and its associated quickstart. There are some similarities in the cluster which gets deployed, including the setup of a user
hadoop
. A key difference, however, is thatbdutil
will also include and configure the GCS connector for Hadoop so that your MapReduce can operate directly against the data in GCS rather than needing to copy it into HDFS first.