hadoop Streaming是Hadoop提供的多语言编程工具,通过该工具,用户可采用任何语言编写MapReduce程序,本文将介绍几个Hadoop Streaming编程实例,大家可重点从以下几个方面学习:
(1)对于一种编写语言,应该怎么编写Mapper和Reduce,需遵循什么样的编程规范
(2) 如何在Hadoop Streaming中自定义Hadoop Counter
(3) 如何在Hadoop Streaming中自定义状态信息,进而给用户反馈当前作业执行进度
(4) 如何在Hadoop Streaming中打印调试日志,在哪里可以看到这些日志
(5)如何使用Hadoop Streaming处理二进制文件,而不仅仅是文本文件
本文重点解决前四个问题,给出了C++和Shell编写的Wordcount实例,供大家参考。
1. C++版WordCount
(1)Mapper实现(mapper.cpp)
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #include <iostream>#include <string>usingnamespacestd;intmain() { string key; while(cin >> key) { cout << key << "\t"<< "1"<< endl; // Define counter named counter_no in group counter_group cerr << "reporter:counter:counter_group,counter_no,1\n"; // dispaly status cerr << "reporter:status:processing......\n"; // Print logs for testing cerr << "This is log, will be printed in stdout file\n"; } return0;} | 
(2)Reducer实现(reducer.cpp)
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | #include <iostream>#include <string>usingnamespacestd;intmain() { //reducer将会被封装成一个独立进程,因而需要有main函数 string cur_key, last_key, value; cin >> cur_key >> value; last_key = cur_key; intn = 1; while(cin >> cur_key) { //读取map task输出结果 cin >> value; if(last_key != cur_key) { //识别下一个key cout << last_key << "\t"<< n << endl; last_key = cur_key; n = 1; } else{ //获取key相同的所有value数目 n++; //key值相同的,累计value值 } } cout << last_key << "\t"<< n << endl; return0;} | 
(3)编译运行
编译以上两个程序:
g++ -o mapper mapper.cpp
g++ -o reducer reducer.cpp
测试一下:
echo “dong xicheng is here now, talk to dong xicheng now” | ./mapper | sort | ./reducer
注:上面这种测试方法会频繁打印以下字符串,可以先注释掉,这些字符串hadoop能够识别
reporter:counter:counter_group,counter_no,1
reporter:status:processing……
This is log, will be printed in stdout file
测试通过后,可通过以下脚本将作业提交到集群中(run_cpp_mr.sh):
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #!/bin/bashHADOOP_HOME=/opt/yarn-clientINPUT_PATH=/test/inputOUTPUT_PATH=/test/outputecho"Clearing output path: $OUTPUT_PATH"$HADOOP_HOME/bin/hadoopfs -rmr $OUTPUT_PATH${HADOOP_HOME}/bin/hadoopjar\ ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar\ -files mapper,reducer\ -input $INPUT_PATH\ -output $OUTPUT_PATH\ -mapper mapper\ -reducer reducer | 
2. Shell版WordCount
(1)Mapper实现(mapper.sh)
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | #! /bin/bashwhilereadLINE; do forword in$LINE do echo"$word 1" # in streaming, we define counter by # [reporter:counter:<group>,<counter>,<amount>] # define a counter named counter_no, in group counter_group # increase this counter by 1 # counter shoule be output through stderr echo"reporter:counter:counter_group,counter_no,1">&2 echo"reporter:counter:status,processing......">&2 echo"This is log for testing, will be printed in stdout file">&2 donedone | 
(2)Reducer实现(mapper.sh)
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #! /bin/bashcount=0started=0word=""whilereadLINE;do newword=`echo$LINE | cut-d ' '-f 1` if[ "$word"!= "$newword"];then [ $started -ne0 ] && echo"$word\t$count" word=$newword count=1 started=1 else count=$(( $count + 1 )) fidoneecho"$word\t$count" | 
(3)测试运行
测试以上两个程序:
echo “dong xicheng is here now, talk to dong xicheng now” | sh mapper.sh | sort | sh reducer.sh
注:上面这种测试方法会频繁打印以下字符串,可以先注释掉,这些字符串hadoop能够识别
reporter:counter:counter_group,counter_no,1
reporter:status:processing……
This is log, will be printed in stdout file
测试通过后,可通过以下脚本将作业提交到集群中(run_shell_mr.sh):
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #!/bin/bashHADOOP_HOME=/opt/yarn-clientINPUT_PATH=/test/inputOUTPUT_PATH=/test/outputecho"Clearing output path: $OUTPUT_PATH"$HADOOP_HOME/bin/hadoopfs -rmr $OUTPUT_PATH${HADOOP_HOME}/bin/hadoopjar\ ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar\ -files mapper.sh,reducer.sh\ -input $INPUT_PATH\ -output $OUTPUT_PATH\ -mapper "sh mapper.sh"\ -reducer "sh reducer.sh" | 
3. 程序说明
在Hadoop Streaming中,标准输入、标准输出和错误输出各有妙用,其中,标准输入和输出分别用于接受输入数据和输出处理结果,而错误输出的意义视内容而定:
(1)如果标准错误输出的内容为:reporter:counter:group,counter,amount,表示将名称为counter,所在组为group的hadoop counter值增加amount,hadoop第一次读到这个counter时,会创建它,之后查找counter表,增加对应counter值
(2)如果标准错误输出的内容为:reporter:status:message,则表示在界面或者终端上打印message信息,可以是一些状态提示信息
(3)如果采用错误输出的内容不是以上两种情况,则表示调试日志,Hadoop会将其重定向到stderr文件中。注:每个Task对应三个日志文件,分别是stdout、stderr和syslog,都是文本文件,可以在web界面上查看这三个日志文件内容,也可以登录到task所在节点上,到对应目录中查看。
另外,需要注意一点,默认Map Task输出的key和value分隔符是\t,Hadoop会在Map和Reduce阶段按照\t分离key和value,并对key排序,注意这点非常重要,当然,你可以使用stream.map.output.field.separator指定新的分隔符。
 
			
 
	