看到一篇不知道是好好玩还是好玩玩童鞋的博客,发现一道好玩的mapreduce题目,地址http://www.cnblogs.com/songhaowan/p/7239578.html

如图

由于自己太笨,看到一大堆java代码就头晕、心慌,所以用python把这个题目研究了一下。


题目:寻找共同好友。比如A的好友中有C,B的好友中有C,那么C就是AB的共同好友。

A:B,C,D,F,E,O

B:A,C,E,K

C:F,A,D,I

D:A,E,F,L

E:B,C,D,M,L

F:A,B,C,D,E,O,M

G:A,C,D,E,F

H:A,C,D,E,O

I:A,O

J:B,O

K:A,C,D

L:D,E,F

M:E,F,G

O:A,H,I,J

m.py

#-*-encoding:utf-8-*-
#!/home/hadoop/anaconda2/bin/python
import sys
result = {}
for line in sys.stdin:
line = line.strip()
if len(line)==0:
continue
key,vals = line.split(':')
val = vals.split(',')
result[key] = val
if len(result)==1:
continue
else:
for i in result[key]:
for j in result:
if i in result[j]:
if j<key:
print j+key,i
elif j>key:
print key+j,i

r.py

#-*-encoding:utf-8-*-
import sys
result = {}
for line in sys.stdin:
line = line.strip()
k,v = line.split(' ')
if k in result:
result[k].append(v)
else:
result[k] = [v]
for key,val in result.items():
print key,val

执行的命令

hadoop jar /home/hadoop/hadoop-2.7.2/hadoop-streaming-2.7.2.jar \
-files /home/hadoop/test/m.py,/home/hadoop/test/r.py \
-input GTHY -output GTHYout \
-mapper 'python m.py' -reducer 'python r.py'

执行情况

packageJobJar: [/tmp/hadoop-unjar2310332345933071298/] [] /tmp/streamjob8006362102585628853.jar tmpDir=null
17/08/31 14:47:59 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.228.200:18040
17/08/31 14:48:00 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.228.200:18040
17/08/31 14:48:00 INFO mapred.FileInputFormat: Total input paths to process : 1
17/08/31 14:48:00 INFO mapreduce.JobSubmitter: number of splits:2
17/08/31 14:48:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1504148710826_0003
17/08/31 14:48:01 INFO impl.YarnClientImpl: Submitted application application_1504148710826_0003
17/08/31 14:48:01 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1504148710826_0003/
17/08/31 14:48:01 INFO mapreduce.Job: Running job: job_1504148710826_0003
17/08/31 14:48:08 INFO mapreduce.Job: Job job_1504148710826_0003 running in uber mode : false
17/08/31 14:48:08 INFO mapreduce.Job: map 0% reduce 0%
17/08/31 14:48:16 INFO mapreduce.Job: map 100% reduce 0%
17/08/31 14:48:21 INFO mapreduce.Job: map 100% reduce 100%
17/08/31 14:48:21 INFO mapreduce.Job: Job job_1504148710826_0003 completed successfully
17/08/31 14:48:21 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=558
FILE: Number of bytes written=362357
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=462
HDFS: Number of bytes written=510
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=11376
Total time spent by all reduces in occupied slots (ms)=2888
Total time spent by all map tasks (ms)=11376
Total time spent by all reduce tasks (ms)=2888
Total vcore-milliseconds taken by all map tasks=11376
Total vcore-milliseconds taken by all reduce tasks=2888
Total megabyte-milliseconds taken by all map tasks=11649024
Total megabyte-milliseconds taken by all reduce tasks=2957312
Map-Reduce Framework
Map input records=27
Map output records=69
Map output bytes=414
Map output materialized bytes=564
Input split bytes=192
Combine input records=0
Combine output records=0
Reduce input groups=69
Reduce shuffle bytes=564
Reduce input records=69
Reduce output records=33
Spilled Records=138
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=421
CPU time spent (ms)=2890
Physical memory (bytes) snapshot=709611520
Virtual memory (bytes) snapshot=5725220864
Total committed heap usage (bytes)=487063552
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=270
File Output Format Counters
Bytes Written=510
17/08/31 14:48:21 INFO streaming.StreamJob: Output directory: GTHYout

最终结果

hadoop@master:~/test$ hadoop fs -text GTHYout/part-00000
BD ['A', 'E']
BE ['C']
BF ['A', 'C', 'E']
BG ['A', 'C', 'E']
BC ['A']
DF ['A', 'E']
DG ['A', 'E', 'F']
DE ['L']
HJ ['O']
HK ['A', 'C', 'D']
HI ['A', 'O']
HO ['A']
HL ['D', 'E']
FG ['A', 'C', 'D', 'E']
LM ['E', 'F']
KO ['A']
AC ['D', 'F']
AB ['C', 'E']
AE ['B', 'C', 'D']
AD ['E', 'F']
AG ['C', 'D', 'E', 'F']
AF ['B', 'C', 'D', 'E', 'O']
EG ['C', 'D']
EF ['B', 'C', 'D', 'M']
CG ['A', 'D', 'F']
CF ['A', 'D']
CE ['D']
CD ['A', 'F']
IK ['A']
IJ ['O']
IO ['A']
HM ['E']
KL ['D']

突然发现代码中居然一句注释都没有。果然自己还是太辣鸡,还没养成好习惯。

由于刚接触大数据不久,对java不熟悉,摸索地很慢。希望python的轻便能助我在大数据的世界探索更多。

有错的地方还请大佬多多指出~

最新文章

  1. Doxygen给C程序生成注释文档
  2. 网格测地线算法(Geodesics in Heat)附源码
  3. CAD的API们
  4. 【python】传入函数
  5. Swift Internal Parameter and External Parameter 外部参数和内部参数
  6. img src 使用 base64 图片数据
  7. JavaScript高级程序设计(第三版)第二章 在HTML中使用JavaScript
  8. javascript在alert()出现中文乱码
  9. Eclipse Code Template 设置自动加注释(转)
  10. codevs 3119 高精度练习之大整数开根 (各种高精+压位)
  11. (黑客游戏)HackTheGame1.21 过关攻略
  12. linux 克隆:device eth0 does not seem to be present,delaying initialization
  13. log4j:ERROR Could not find value for key log4j.appender.error
  14. Qt生成灰度图(转载)
  15. API网关Ocelot 使用Polly 处理部分失败问题
  16. pygame加载中文名mp3文件出现error
  17. mysql--二进制日志(bin-log)三种格式介绍及分析
  18. 版本管理_git
  19. 2018-2019-2 《Java程序设计》结对项目阶段总结《四则运算——整数》(二)
  20. R语言数据接口

热门文章

  1. Python运维开发基础-概述-hello world
  2. spring加载异常
  3. Pivot Table系列之切片器 (Slicer)
  4. Swift 轻量级网络层设计
  5. echarts添加点击事件
  6. python中的类属性和实例属性
  7. git分支管理之Feature分支
  8. JS 无法清除Cookie的解决方法
  9. Open-Falcon第一步环境准备(小米开源互联网企业级监控系统)
  10. linux下添加定时任务