Apache – Hadoop: wrong classpath in map reduce job

apacheclouderahadoophbasemapreduce

I'm running a cloudera cluster in 3 virtual maschines and try to execute hbase bulk load via a map reduce job. But I got always the error:

error: Class org.apache.hadoop.hbase.mapreduce.HFileOutputFormat not found

So, it seems that the map process doesnt find the class. So I tried this:

1) add the hbase.jar to the HADOOP_CLASSPATH on every node

2) adding TableMapReduceUtil.addDependencyJars(job) / TableMapReduceUtil.addDependencyJars(myConf, HFileOutputFormat.class) to my source code

nothing worked. I have absolute no idea why the class is not found, because the jar/class is definitely available in the classpath.

If I take a look into the job.xml I see the following entry:

name=tmpjars    value=file:/C:/Users/Thomas/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5-cdh4.3.0/zookeeper-3.4.5-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hbase/hbase/0.94.6-cdh4.3.0/hbase-0.94.6-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.3.0/hadoop-core-2.0.0-mr1-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar,file:/C:/Users/Thomas/.m2/repository/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar

This seems a little bit odd to me, these are my local jars on the windows system. Maybe this should be the hdfs jars? If yes, how can I change the values for "tmpjars"?

Here is the java code I try to execute:

        configuration = new Configuration(false);
        configuration.set("mapred.job.tracker", "192.168.2.41:8021");
        configuration.set("fs.defaultFS", "hdfs://192.168.2.41:8020/");
        configuration.set("hbase.zookeeper.quorum", "192.168.2.41");
        configuration.set("hbase.zookeeper.property.clientPort", "2181");

        Job job = new Job(configuration, "HBase Bulk Import for "
                + tablename);
        job.setJarByClass(HBaseKVMapper.class);

        job.setMapperClass(HBaseKVMapper.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(KeyValue.class);
        job.setOutputFormatClass(HFileOutputFormat.class);
        job.setPartitionerClass(TotalOrderPartitioner.class);
        job.setInputFormatClass(TextInputFormat.class);
        HFileOutputFormat.configureIncrementalLoad(job, hTable);

        FileInputFormat.addInputPath(job, new Path("myfile1"));
        FileOutputFormat.setOutputPath(job, new Path("myfile2"));

        job.waitForCompletion(true);

        LoadIncrementalHFiles loader = new LoadIncrementalHFiles(
                configuration);
        loader.doBulkLoad(new Path("myFile3"), hTable);

EDIT:

I tried a little bit more and its totaly strange. I add the following line to the java code:

job.setJarByClass(HFileOutputFormat.class);

after I executed this, the error is gone, but another class not found exception appear:

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class mypackage.bulkLoad.HBaseKVMapper not found

HBaseKVMapper is my custom Mapper Class I want to execute. So, I tried to add it with "job.setJarByClass(HBaseKVMapper.class)", but it doesnt work since its its only a class file and no jar. So I generated a Jarfile including HBaseKVMapper.class. After that, I executed it again and now got the HFileOutputFormat.class not found exception again.

After debugging a little bit, I found out that the setJarByClass Methode only copies the local jar file to .staging/job_#number/job.jar on HDFS. So, this setJarByClass() Method will only work for one jar file because it overwrites the job.jar after executing setJarByClass() again with another jar.

While searching for the eroor I saw the following strcuture in the the job staging direcotry:

job staging direcotry

and inside the libjars direcotry I saw the relevant jar files

libjars directory

so, the hbase jar is inside the libjars directory but the jobtracker doesn't use this it for executing the job. Why?

Best Answer

I would try using Cloudera Manager (free version) as it takes care of these issues for you. Otherwise note the following:

Both your own classes and the HBase Class HFileOutputFormat need to be available on the classpath locally and remotely.

Submitting the job

Meaning getting the classpath right locally for when your driver runs:

$ env HADOOP_CLASSPATH=$(hbase classpath) hadoop jar path/to/jar class....

On the server

In your hadoop-env.sh

export HADOOP_CLASSPATH=$(hbase claspath)

or use

TableMapReduceUtil.addDependencyJars
Related Topic