mapreduce实现一个简单的单词计数的功能。

一,准备工作:eclipse 安装hadoop 插件:

下载相关版本的hadoop-eclipse-plugin-2.2.0.jar到eclipse/plugins下。

二,实现:

新建mapreduce project

map 用于分词,reduce计数。

package tank.demo;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /**
* @author tank
* @date:2015年1月5日 上午10:03:43
* @description:记词器
* @version :0.1
*/ public class WordCount {
public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
} public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
} public static void main(String[] args) throws Exception { Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: wordcount ");
System.exit(2);
}
Job job = new Job(conf, "word count");
//主类
job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
//map输出格式
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//输出格式
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1);
} }

打包world-count.jar

三,准备输入数据

hadoop fs -mkdir /user/hadoop/input//建好输入目录

//随便写点数据文件

echo hello my hadoop this is my first application>file1

echo hello world my deer my applicaiton >file2

//拷贝到hdfs中

hadoop fs -put file* /user/hadoop/input

hadoop fs -ls /user/hadoop/input //查看

四,运行

上传到集群环境中:

hadoop jar world-count.jar  WordCount input output

截取一段输出如:

15/01/05 11:14:36 INFO mapred.Task: Task:attempt_local1938802295_0001_r_000000_0 is done. And is in the process of committing
15/01/05 11:14:36 INFO mapred.LocalJobRunner:
15/01/05 11:14:36 INFO mapred.Task: Task attempt_local1938802295_0001_r_000000_0 is allowed to commit now
15/01/05 11:14:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1938802295_0001_r_000000_0' to hdfs://192.168.183.130:9000/user/hadoop/output/_temporary/0/task_local1938802295_0001_r_000000
15/01/05 11:14:36 INFO mapred.LocalJobRunner: reduce > reduce
15/01/05 11:14:36 INFO mapred.Task: Task 'attempt_local1938802295_0001_r_000000_0' done.
15/01/05 11:14:36 INFO mapreduce.Job: Job job_local1938802295_0001 running in uber mode : false
15/01/05 11:14:36 INFO mapreduce.Job:  map 100% reduce 100%
15/01/05 11:14:36 INFO mapreduce.Job: Job job_local1938802295_0001 completed successfully
15/01/05 11:14:36 INFO mapreduce.Job: Counters: 32
        File System Counters
                FILE: Number of bytes read=17706
                FILE: Number of bytes written=597506
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=205
                HDFS: Number of bytes written=85
                HDFS: Number of read operations=25
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=5
        Map-Reduce Framework
                Map input records=2
                Map output records=14
                Map output bytes=136
                Map output materialized bytes=176
                Input split bytes=232
                Combine input records=0
                Combine output records=0
                Reduce input groups=10
                Reduce shuffle bytes=0
                Reduce input records=14
                Reduce output records=10
                Spilled Records=28
                Shuffled Maps =0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=67
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=456536064
        File Input Format Counters
                Bytes Read=80
        File Output Format Counters
                Bytes Written=85

查看输出目录下的文件

[hadoop@tank1 ~]$ hadoop fs -cat /user/hadoop/output/part-r-00000
applicaiton     1
application     1
deer    1
first   1
hadoop  1
hello   2
is      1
my      4
this    1
world   1

已经正确统计出单词数量!

最新文章

  1. nodejs进阶(2)—函数模块调用
  2. 好用的內存鏡像工具Belkasoft RAM Capture
  3. HDU 1010 Tempter of the Bone
  4. [OpenCV] Image Processing - Grayscale Transform
  5. Android logcat
  6. Flash Builder 4.6 界面显示一半中文一半英文?
  7. Git的安装以及一些操作
  8. 命令删除visualstudio.com云端项目(TFS)
  9. Oracle运维必修内功:前瞻性运维理念
  10. Java中Iterator(迭代器)的用法及其背后机制的探究
  11. centos7.2 linux 64位系统上安装mysql
  12. PyQt5安装目录中找不到designer.exe与pyrcc5.exe
  13. DB2还原数据库备份
  14. Swagger2 header 添加token
  15. 从mysql主从复制到微信开源的phxsql
  16. Could not update the distribution database subscription table. The subscription status could not be changed.
  17. django中怎么使用自定义管理后台xadmin
  18. CentOS6.9安装HDFS
  19. 潭州课堂25班:Ph201805201 django 项目 第二十五课 文章多级评论前后台实现 (课堂笔记)
  20. Mybatis 不同使用方式

热门文章

  1. 向继电器发送socket请求(python+java)
  2. Luogu5289 十二省联考2019字符串问题(后缀数组+拓扑排序+线段树/主席树/KDTree)
  3. Java json转model
  4. Python中xlwt解析
  5. BZOJ2554 color 【概率DP】【期望DP】
  6. 搭建本地yum源并定时同步
  7. 「HNOI2016」最小公倍数 解题报告
  8. HR_Two Strings
  9. 由AC自动机引发的灵感
  10. A1040. Longest Symmetric String