Hadoop – Which HDFS operations are atomic

atomichadoophdfsmv

I am trying to write code to import files into HDFS for use as a hive external table. I have found that using something like:

foo | ssh hostname "hdfs dfs -put – /destination/$FILENAME"

can cause a type of error where a temporary file is created and then renamed when complete. This can cause a race condition for hive between a directory listing and query execution.

One workaround is to copy to a temporary directory and "hdfs dfs mv" the file into position.

The specific and general/academic questions are:

  1. The "hdfs dfs -mv" command is atomic, right?
  2. What other HDFS commands or operations are atomic?
  3. Can two "hdfs dfs -mkdir" commands issued at approximately the same time believe they both succeeded?
  4. Is there better way to avoid race conditions with hive when moving files into position?

Best Answer

In Hadoop FS introduction you can find requirements for atomicity

Here are the core expectations of a Hadoop-compatible FileSystem. Some FileSystems do not meet all these expectations; as a result, some programs may not work as expected.

Atomicity

There are some operations that MUST be atomic. This is because they are often used to implement locking/exclusive access between processes in a cluster.

  1. Creating a file. If the overwrite parameter is false, the check and creation MUST be atomic.
  2. Deleting a file.
  3. Renaming a file.
  4. Renaming a directory.
  5. Creating a single directory with mkdir().

...

Most other operations come with no requirements or guarantees of atomicity.

So to be sure you must check underlying filesystem. But based on those requirements answers are:

  1. yes
  2. listed above
  3. no
  4. imho renaming a file is good choice for the job