Hadoop HDFS: set file block size from commandline

blockhadoophdfs

I need to set the block-size of a file when I load it into HDFS, to some value lower than the cluster block size. For example, if HDFS is using 64mb blocks, I may want a large file to be copied in with 32mb blocks.

I've done this before within a Hadoop workload using the org.apache.hadoop.fs.FileSystem.create() function, but is there a way to do it from the commandline ?

Best Answer

You can do this by setting -Ddfs.block.size=something with your hadoop fs command. For example:

hadoop fs -Ddfs.block.size=1048576  -put ganglia-3.2.0-1.src.rpm /home/hcoyote

As you can see here, the block size changes to what you define on the command line (in my case, the default is 64MB, but I'm changing it down to 1MB here).

:;  hadoop fsck -blocks -files -locations /home/hcoyote/ganglia-3.2.0-1.src.rpm 
FSCK started by hcoyote from /10.1.1.111 for path /home/hcoyote/ganglia-3.2.0-1.src.rpm at Mon Aug 15 14:34:14 CDT 2011
/home/hcoyote/ganglia-3.2.0-1.src.rpm 1376561 bytes, 2 block(s):  OK
0. blk_5365260307246279706_901858 len=1048576 repl=3 [10.1.1.115:50010, 10.1.1.105:50010, 10.1.1.119:50010]
1. blk_-6347324528974215118_901858 len=327985 repl=3 [10.1.1.106:50010, 10.1.1.105:50010, 10.1.1.104:50010]

Status: HEALTHY
 Total size:    1376561 B
 Total dirs:    0
 Total files:   1
 Total blocks (validated):  2 (avg. block size 688280 B)
 Minimally replicated blocks:   2 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    3
 Average block replication: 3.0
 Corrupt blocks:        0
 Missing replicas:      0 (0.0 %)
 Number of data-nodes:      12
 Number of racks:       1
FSCK ended at Mon Aug 15 14:34:14 CDT 2011 in 0 milliseconds


The filesystem under path '/home/hcoyote/ganglia-3.2.0-1.src.rpm' is HEALTHY

Related Solutions

How does ZFS Block Level Deduplication fit with Variable Block Size

ZFS deduplication works on blocks (recordlength) it does not know/care about files. Each block is checksummed using sha256 (by default changeable). If the checksum matches an other block it will just reference the same record and no new data will be written. One problem of deduplication with ZFS is that checksums are kept in memory so large pools will require a lot of memory. So you should only apply reduplication when using large record length

Assuming recordlength 128k

If I a randomly filled file of 1GB, then I write a second file that is the same except half way through, I change one of the bytes. Will that file be deduplicated (all except for the changed byte's block?)

Yes only one block will not be duplicated.

If I write a single byte file, will it take a whole 128 kilobytes? If not, will the blocks get larger in the event the file gets longer?

128k will be allocated, if the file size grows above 128k more blocks will be allocated as needed.

If a file takes two 64kilobyte blocks (would this ever happen?), then would an identical file get deduped after taking a single 128 kilobyte block

A file will take 128k the same file will be deduplicated

If a file is shortened, then part of its block would have been ignored, perhaps the data would not be reset to 0x00 bytes. Would a half used block get deduced?

If the exact same block is found yes

How to fix Hadoop HDFS cluster with missing blocks after one node was reinstalled

I ended up having to delete the files with bad blocks, which after further investigation, I realized had a very low replication (rep=1 if I recall correctly).

This SO post has more information on finding the files with bad blocks, using something along the lines of:

hadoop fsck / | egrep -v '^\.+$' | grep -v eplica

So, to answer my own questions:

Can these files be recovered? Not unless the failed nodes/drives are brought back online with the missing data.
How do I get out of safe mode? Remove these troublesome files, and then leave safe mode via dfsadmin.

Best Answer

Related Solutions

How does ZFS Block Level Deduplication fit with Variable Block Size

How to fix Hadoop HDFS cluster with missing blocks after one node was reinstalled

Related Topic