Hadoop | Hadoop, Data Warehousing and Big Data

Let’s assume you copy a large file to HDFS, and you need to launch a MapReduce job when the copy finishes. How to make sure that the file is fully copied?

Fortunately, when the copy is still in progress, HDFS adds _COPYING_ suffix to the file name, and removes it when the operation is complete.

Let’s emulate a long copy process and put the output file to HDFS:

echo `sleep 60` | hadoop fs -put - /user/v-dtolpeko/copy.txt

This command writes to HDFS file /user/v-dtolpeko/copy.txt from STDIN. Just one byte (0x0A – end of line) is written to the file in HDFS but sleep command holds STDIN open for 60 seconds so we can see what happens in HDFS:

Open another session to the Hadoop cluster and run:

hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}'

This command lists the directory content in HDFS /user/v-dtolpeko/ directory, and just outputs file names:

[dtolpeko ~]$hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}'
/user/v-dtolpeko/.Trash
/user/v-dtolpeko/.staging
/user/v-dtolpeko/copy.txt._COPYING_
/user/v-dtolpeko/identity.pl

You can see that _COPYING_ suffix was added to copy.txt. When the first session completes the copy, and you rerun ls command, you can see that the suffix removed:

[dtolpeko ~]$hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}'
/user/v-dtolpeko/.Trash
/user/v-dtolpeko/.staging
/user/v-dtolpeko/copy.txt
/user/v-dtolpeko/identity.pl

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Hadoop, Data Warehousing and Big Data

by Dmitry Tolpeko

Tag Archives: Hadoop

Apache Kafka – Messaging System Overview

HDFS – File Copy in Progress – Suffix _COPYING_ Added to Filename