Namenodes fails starting on HA cluster – Fatals exists in Journalnode logs


I am having some problem with my Hadoop Cluster

Centos 7.3 Hortonworks Ambari 2.4.2 Hortonworks HDP 2.5.3

Ambari stderr:

2017-04-06 10:49:49,039 - Getting jmx metrics from NN failed. URL: http://master02.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/", line 38, in get_value_from_jmx
    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/", line 61, in get_user_call_output
    raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'curl -s 'http://master02.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmp0CNZmD 2>/tmp/tmpRAZgwz' returned 7. 

2017-04-06 10:49:51,041 - Getting jmx metrics from NN failed. URL: http://master03.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/", line 38, in get_value_from_jmx
    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/", line 61, in get_user_call_output
    raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'cur

l -s 'http://master03.mydomain.local:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmp_hLNY7 2>/tmp/tmpoCOTt8' returned 7. 
(tries several times and then)
Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/HDFS/", line 420, in <module>
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/", line 280, in execute
  File "/var/lib/ambari-agent/cache/common-services/HDFS/", line 101, in start
    upgrade_suspended=params.upgrade_suspended, env=env)
  File "/usr/lib/python2.6/site-packages/ambari_commons/", line 89, in thunk
    return fn(*args, **kwargs)
  File "/var/lib/ambari-agent/cache/common-services/HDFS/", line 184, in namenode
    if is_this_namenode_active() is False:
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/", line 55, in wrapper
    return function(*args, **kwargs)
  File "/var/lib/ambari-agent/cache/common-services/HDFS/", line 562, in is_this_namenode_active
    raise Fail(format("The NameNode {namenode_id} is not listed as Active or Standby, waiting..."))
resource_management.core.exceptions.Fail: The NameNode nn1 is not listed as Active or Standby, waiting...

Ambari stdout:

2017-04-06 10:53:20,521 - call returned (255, '17/04/06 10:53:20 INFO ipc.Client: Retrying connect to server: master03.mydomain.local/ Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)\n17/04/06 10:53:20 WARN ipc.Client: Failed to connect to server: master03.mydomain.local/ retries get failed due to exceeded maximum allowed retries number: 1
2017-04-06 10:53:20,522 - No active NameNode was found after 5 retries. Will return current NameNode HA states

Namenode log:

2017-04-06 10:11:43,561FATALError: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [,,], stream=null)) java.lang.AssertionError: Decided to synchronize log to startTxId: 1 endTxId: 1 isInProgress: true but logger had seen txid 1865764 committed at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment( at (some class at some other class at ...)

Some more logs from Namenode:

2017-04-06 10:11:42,380 INFO  ipc.Server ( - IPC Server handler 72 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from Call#2322 Retry#0
org.apache.hadoop.ipc.RetriableException: NameNode still not started
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(
        at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$
        at org.apache.hadoop.ipc.RPC$
        at org.apache.hadoop.ipc.Server$Handler$
        at org.apache.hadoop.ipc.Server$Handler$
        at Method)
        at org.apache.hadoop.ipc.Server$
2017-04-06 10:11:42,390 INFO  namenode.NameNode ( - NameNode RPC up at: bigm02.etstur.local/
2017-04-06 10:11:42,391 INFO  namenode.FSNamesystem ( - Starting services required for standby state
2017-04-06 10:11:42,393 INFO  ha.EditLogTailer (<init>(117)) - Will roll logs on active node at bigm03.etstur.local/ every 120 seconds.
2017-04-06 10:11:42,397 INFO  ha.StandbyCheckpointer ( - Starting standby checkpoint thread...
Checkpointing active NN at http://bigm03.etstur.local:50070
Serving checkpoints at http://bigm02.etstur.local:50070
2017-04-06 10:11:43,371 INFO  namenode.FSNamesystem ( - Stopping services started for standby state
2017-04-06 10:11:43,372 WARN  ha.EditLogTailer ( - Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$
2017-04-06 10:11:43,475 INFO  namenode.FSNamesystem ( - Starting services required for active state
2017-04-06 10:11:43,485 INFO  client.QuorumJournalManager ( - Starting recovery process for unclosed journal segments...
2017-04-06 10:11:43,534 INFO  client.QuorumJournalManager ( - Successfully started new epoch 17
2017-04-06 10:11:43,535 INFO  client.QuorumJournalManager ( - Beginning recovery of unclosed segment starting at txid 1
2017-04-06 10:11:43,557 INFO  client.QuorumJournalManager ( - Recovery prepare phase complete. Responses: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 14 lastCommittedTxId: 1865764 segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 14 lastCommittedTxId: 1865764
2017-04-06 10:11:43,560 INFO  client.QuorumJournalManager ( - Using longest log: {
  startTxId: 1
  endTxId: 1
  isInProgress: true
lastWriterEpoch: 14
lastCommittedTxId: 1865764

2017-04-06 10:11:43,561 FATAL namenode.FSEditLog ( - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [,,], stream=null))
java.lang.AssertionError: Decided to synchronize log to startTxId: 1
endTxId: 1
isInProgress: true
 but logger had seen txid 1865764 committed
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment(
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(
        at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(
        at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(
        at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(
        at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(
        at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(
        at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$
        at org.apache.hadoop.ipc.RPC$
        at org.apache.hadoop.ipc.Server$Handler$
        at org.apache.hadoop.ipc.Server$Handler$
        at Method)
        at org.apache.hadoop.ipc.Server$
2017-04-06 10:11:43,562 INFO  util.ExitUtil ( - Exiting with status 1
2017-04-06 10:11:43,563 INFO  namenode.NameNode ( - SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at bigm02.etstur.local/

And although Journal Nodes started succesfully, they have following error which also can be found suspicious:

2017-04-05 17:15:05,653 ERROR RECEIVED SIGNAL 15: SIGTERM

And the backgroud of this error is as following…

Yesterday I noticed that one of the datanodes failed and stopped. There was following errors in the logs:

2017-04-05 15:50:11,168 ERROR datanode.DataNode ( - Initialization failed for Block pool <registering> (Datanode Uuid be2286f5-00d7-4758-b89a-45e2304cabe3) service to master02.mydomain.local/ Exiting. All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead( at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage( at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool( at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo( at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake( at at 2017-04-05 15:50:11,168 ERROR datanode.DataNode ( - Initialization failed for Block pool <registering> (Datanode Uuid be2286f5-00d7-4758-b89a-45e2304cabe3) service to master02.mydomain.local/ Exiting. org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 13, volumes configured: 14, volumes failed: 1, volume failures tolerated: 0

2017-04-05 17:15:36,968 INFO  common.Storage ( - Lock on /grid/13/hadoop/hdfs/data/in_use.lock
 acquired by nodename 31353@data02.mydomain.local

Although seeing volume errors, I was able to browse /grid/13/

So I wanted to try following answers in this stackoverflow question:

Datanode not starts correctly

First I deleted data folder under /grid/13/hadoop/hdfs (/grid/13/hadoop/hdfs/data) and tried to start datanode.

It failed again with same errors so I went with namenode format. My cluster was new and empty so I am fine with any solution including formats:

(In first try I gave block pool id instead of clusterId, command failed.)

./hdfs namenode -format -clusterId <myClusterId>

After this format, one of the namenodes failed. When I tried to restart all HDFS components, both namenodes failed.

Any comments appreciated.

Best Answer

"Hdfs HA formatting

In HA things get a little more complicated. In HA Standby and Active namenodes have a shared storage managed by the journal node service. HA relies on a failover scenario to swap from StandBy to Active Namenode and as any other system in hadoop this uses zookeeper. As you can see a couple more pieces need to made aware of a formatting action.

The initial steps are very close

Stop the Hdfs service Start only the journal nodes (as they will need to be made aware of the formatting) On the first namenode (as user hdfs) hadoop namenode -format​ hdfs namenode -initializeSharedEdits -force (for the journal nodes) hdfs zkfc -formatZK -force (to force zookeeper to reinitialise)​ restart that first namenode On the second namenode hdfs namenode -bootstrapStandby -force ​(force synch with first namenode) On every datanode clear the data directory Restart the HDFS service This was a very simple step by step guide to formatting. In a later article we will cover actually repairing common errors in HDFS"

Related Topic