Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages – producers and consumers of information.
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages – producers and consumers of information.
Let’s assume you copy a large file to HDFS, and you need to launch a MapReduce job when the copy finishes. How to make sure that the file is fully copied?
Fortunately, when the copy is still in progress, HDFS adds _COPYING_ suffix to the file name, and removes it when the operation is complete.
Let’s emulate a long copy process and put the output file to HDFS:
echo `sleep 60` | hadoop fs -put - /user/v-dtolpeko/copy.txt
This command writes to HDFS file /user/v-dtolpeko/copy.txt from STDIN. Just one byte (0x0A – end of line) is written to the file in HDFS but sleep command holds STDIN open for 60 seconds so we can see what happens in HDFS:
Open another session to the Hadoop cluster and run:
hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}'
This command lists the directory content in HDFS /user/v-dtolpeko/ directory, and just outputs file names:
[dtolpeko ~]$hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}' /user/v-dtolpeko/.Trash /user/v-dtolpeko/.staging /user/v-dtolpeko/copy.txt._COPYING_ /user/v-dtolpeko/identity.pl
You can see that _COPYING_ suffix was added to copy.txt. When the first session completes the copy, and you rerun ls command, you can see that the suffix removed:
[dtolpeko ~]$hadoop fs -ls /user/v-dtolpeko | tail -n+2 | awk '{print $8}' /user/v-dtolpeko/.Trash /user/v-dtolpeko/.staging /user/v-dtolpeko/copy.txt /user/v-dtolpeko/identity.pl