本来只是想拿搜狗的数据练练手的,却无意踏足MR的topK问题。经过几番波折,虽然现在看起来很简单,但是摸爬滚打中也学到了不少


数据是搜狗实验室下的搜索日志,格式大概为:

 00:00:00    2982199073774412    [360安全卫士]    8 3    download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html
00:00:00 07594220010824798 [哄抢救灾物资] 1 1 news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml
00:00:00 5228056822071097 [75810部队] 14 5 www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE%C2%D6%D0%C2%CE%C5
00:00:00 6140463203615646 [绳艺] 62 36 www.jd-cd.com/jd_opus/xx/200607/706.html
00:00:00 8561366108033201 [汶川地震原因] 3 2 www.big38.net/
00:00:00 23908140386148713 [莫衷一是的意思] 1 2 www.chinabaike.com/article/81/82/110/2007/2007020724490.html
00:00:00 1797943298449139 [星梦缘全集在线观看] 8 5 www.6wei.net/dianshiju/????\xa1\xe9|????do=index
00:00:00 00717725924582846 [闪字吧] 1 2 www.shanziba.com/

我只是要搜索词,其他的不管,然后通过MR计算出搜索量最高的前N个词(N自定义)

整体项目结构为:


先来个类处理根据日志格式拿出搜索词

SEA.java

 package org.admln.topK;

 /**
* @author admln
*
*/
public class SEA { private String seaWord; private boolean isValid; public static SEA parser(String line) {
SEA sea = new SEA();
String str = line.split("\t")[2];
if(str.length()<3) {
sea.setValid(false);
}else {
sea.setValid(true);
sea.setSeaWord(str.substring(1, str.length()-1));
}
return sea;
} public String getSeaWord() {
return seaWord;
} public void setSeaWord(String seaWord) {
this.seaWord = seaWord;
} public boolean isValid() {
return isValid;
} public void setValid(boolean isValid) {
this.isValid = isValid;
} }

然后就是MR

 package org.admln.topK;

 import java.io.IOException;
import java.util.Collections;
import java.util.Map.Entry;
import java.util.Set;
import java.util.TreeMap; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /**
* @author admln
*
*/
public class TopK { public static class topKMapper extends
Mapper<Object, Text, Text, IntWritable> {
Text word = new Text();
IntWritable ONE = new IntWritable(1); @Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
SEA sea = SEA.parser(value.toString());
if (sea.isValid()) {
word.set(sea.getSeaWord());
context.write(word, ONE);
}
}
} public static class topKReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
int sum;
int max;
private static TreeMap<Integer,String> tree = new TreeMap<Integer,String>(Collections.reverseOrder()); public void reduce(Text key, Iterable<IntWritable> values,
Context context) {
sum = 0;
max = context.getConfiguration().getInt("topK", 10);
for (IntWritable val : values) {
sum += val.get();
}
tree.put(Integer.valueOf(sum), key.toString());
if (tree.size() > max) {
tree.remove(tree.lastKey());
} } @Override
protected void cleanup(Context context) throws IOException, InterruptedException {
Set<Entry<Integer, String>> set = tree.entrySet();
for (Entry<Integer, String> entry : set) {
context.write(new Text(entry.getValue()), new IntWritable(entry.getKey()));
}
}
} public static void main(String[] args) throws Exception {
Path input = new Path("hdfs://hadoop:8020/input/topK/");
Path output = new Path("hdfs://hadoop:8020/output/topK/"); Configuration conf = new Configuration(); conf.setInt("topK", Integer.valueOf(args[1])); Job job = new Job(conf, "topK"); job.setJarByClass(TopK.class); job.setMapperClass(topKMapper.class);
job.setReducerClass(topKReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

然后上传数据(注意文件格式要从gb2312改成utf-8的。因为hadoop全部是utf-8编码的。如果不转码最后结果中文就是乱码)

本机调试或者上传到hadoop上运行

机器环境是centos6.4、hadoop是2.2.0、JDK是1.7

运行结果:


重要知识点:

  1.TreeMap,虽然是Java的知识,还是普及了一下;

  2.cleanup,这个复写API的执行时间要知道。


源码:http://pan.baidu.com/s/1i3y0rwL


最新文章

  1. Picasso设置圆角
  2. python错误类型
  3. BZOJ 1029: [JSOI2007]建筑抢修 贪心
  4. 必看谷歌HTML/CSS规范
  5. 使用isql连接Sybase ASE数据库的常见错误及处理方式
  6. Android ListView滑动底部自动加载更多
  7. poj 1401---求N!末尾0的个数,2的个数一定比5多,观察得来,0的产生即为2*5,去找这个阶乘一行里面5的个数即可
  8. TableView cell自适应高度-----xib
  9. JAVA基础-IO流(一)
  10. SpringBoot入门:新一代Java模板引擎Thymeleaf(理论)
  11. 视频拉流 Linux安装FFmpeg
  12. Intellij IDEA常用快捷键介绍 Intellij IDEA快捷键大全汇总
  13. enum枚举类
  14. ES6与ES5对比 模板字符串
  15. insertBefore(),appendChild()创建添加列表实例
  16. Android Launcher分析和修改7——AllApp全部应用列表(AppsCustomizeTabHost)
  17. Weblogic重起后打开控制台登陆后响应极慢
  18. [翻译]C# BAD PRACTICES: Learn how to make a good code by bad example---C#:如何将坏的代码重新编译为好的代码
  19. UVA10054_The Necklace
  20. R语言画图小结

热门文章

  1. 验证dictionary重复键
  2. bpl
  3. 微软IOC容器Unity简单代码示例1
  4. LDR指令的格式:
  5. mv命令
  6. HDU2819Swap(二分图最大匹配)
  7. CodeForces 711B Chris and Magic Square (暴力,水题)
  8. 你真的会玩SQL吗?实用函数方汇总
  9. C#获取程序集的版本号和最后编译时间
  10. Slony-I的 RemoteWorker重试调查