先谈下我们需求,一个比较大的nginx访问日志,根据访问日期切割日志,保存在/tmp目录下。

测试机器为腾讯云机子,单核1G内存。测试日志大小80M。

不使用多线程版:

#!/usr/bin/env python
# coding=utf-8 import re
import datetime if __name__ == '__main__':
date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):')
with open('./access_all.log-20161227') as f:
for line in f:
day, mon, year = re.search(date_pattern, line).groups()
mon = datetime.datetime.strptime(mon, '%b').month
log_file = '/tmp/%s-%s-%s' % (year, mon, day)
with open(log_file, 'a+') as f:
f.write(line)

耗时:

[root@VM_255_164_centos data_parse]# time python3 log_cut.py 
real 0m41.152s
user 0m32.578s
sys 0m6.046s

多线程版:

#!/usr/bin/env python
# coding=utf-8 import re
import datetime
import threading date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):') def log_cut(line):
day, mon, year = re.search(date_pattern, line).groups()
mon = datetime.datetime.strptime(mon, '%b').month
log_file = '/tmp/%s-%s-%s' % (year, mon, day)
with open(log_file, 'a+') as f:
f.write(line) if __name__ == '__main__':
with open('./access_all.log-20161227') as f:
for line in f:
t = threading.Thread(target=log_cut, args=(line,))
t.setDaemon(True)
t.start()

耗时:

# time python3 log_cut.py 

real    1m35.905s
user 1m10.292s
sys 0m19.666s

使用多线程版竟然比不使用多进程版要慢的多。。cpu密集型任务使用上下文切换果然很耗时。

线程池版:

线程池类

#!/usr/bin/env python
# coding=utf-8 import queue
import threading
import contextlib
import time StopEvent = object() class ThreadPool(object): def __init__(self, max_num, max_task_num = None):
if max_task_num:
self.q = queue.Queue(max_task_num)
else:
self.q = queue.Queue()
self.max_num = max_num
self.cancel = False
self.terminal = False
self.generate_list = []
self.free_list = [] def run(self, func, args, callback=None):
if self.cancel:
return
if len(self.free_list) == 0 and len(self.generate_list) < self.max_num:
self.generate_thread()
w = (func, args, callback,)
self.q.put(w) def generate_thread(self):
t = threading.Thread(target=self.call)
t.start() def call(self):
current_thread = threading.currentThread()
self.generate_list.append(current_thread) event = self.q.get()
while event != StopEvent: func, arguments, callback = event
try:
result = func(*arguments)
success = True
except Exception as e:
success = False
result = None if callback is not None:
try:
callback(success, result)
except Exception as e:
pass with self.worker_state(self.free_list, current_thread):
if self.terminal:
event = StopEvent
else:
event = self.q.get()
else:
self.generate_list.remove(current_thread) def close(self):
self.cancel = True
full_size = len(self.generate_list)
while full_size:
self.q.put(StopEvent) #
full_size -= 1 def terminate(self):
self.terminal = True while self.generate_list:
self.q.put(StopEvent) self.q.queue.clear() @contextlib.contextmanager
def worker_state(self, state_list, worker_thread):
state_list.append(worker_thread)
try:
yield
finally:
state_list.remove(worker_thread)

threadingPool.py

代码

#!/usr/bin/env python
# coding=utf-8 import re
import datetime
from threadingPool import ThreadPool date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+)\:') def log_cut(line):
day, mon, year = date_pattern.search(line).groups()
mon = datetime.datetime.strptime(mon, '%b').month
log_file = '/tmp/%s-%s-%s' % (year, mon, day)
with open(log_file, 'a+') as f:
f.write(line) def callback(status, result):
pass pool = ThreadPool(1) with open('./access_all.log-20161227') as f:
for line in f:
pool.run(log_cut, (line,), callback) pool.close()

耗时:

# time python3 log_cut2.py 

real    0m53.371s
user 0m44.761s
sys 0m5.600s

线程池版比多线程版要快,看来写的线程池类还是有用的。减少了上下文切换时间。

进程池版:

#!/usr/bin/env python
# coding=utf-8 import re
import datetime
from multiprocessing import Pool date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):') def log_cut(line):
day, mon, year = re.search(date_pattern, line).groups()
mon = datetime.datetime.strptime(mon, '%b').month
log_file = '/tmp/%s-%s-%s' % (year, mon, day)
with open(log_file, 'a+') as f:
f.write(line) if __name__ == '__main__':
pool = Pool(1)
with open('./access_all.log-20161227') as f:
for line in f:
pool.apply_async(func=log_cut, args=(line,))
pool.close()

单个进程耗时:

# time python3 log_cut.py 

real    0m28.392s
user 0m23.451s
sys 0m1.888s

2个进程耗时:

# time python3 log_cut.py 

real    0m40.920s
user 0m33.690s
sys 0m3.206s

看来使用多进程时,如果是单核cpu只开一个进程,多核cpu的话开多个速度更快,单核cpu开多个进程速度很慢。

shell版

#!/bin/bash

Usage(){
echo "Usage: $0 Logfile"
} if [ $# -eq ] ;then
Usage
exit
else
Log=$
fi date_log=$(mktemp) cat $Log |awk -F'[ :]' '{print $5}'|awk -F'[' '{print $2}'|uniq > date_log for i in `cat date_log`
do
grep $i $Log > /tmp/log/${i::}-${i::}-${i::}.access done

耗时:

# time sh log_cut.sh access_all.log- 

real    0m2.435s
user 0m2.042s
sys 0m0.304s

shell的效果非常棒啊,只用2s多久完成了。

最新文章

  1. ccc animation
  2. 转载 Android快捷键 转载
  3. RDS MySQL 连接数满情况的处理
  4. eclipse 中手动安装 subversive SVN
  5. mac自带apache服务器开启
  6. 图片放大镜(像淘宝浏览商品一样)JS操作
  7. TC SRM 664 div2 A BearCheats 暴力
  8. button轮番点击,只点击一次,鼠标hover
  9. Longest Palindromic Substring -LeetCode
  10. 开源mp3播放器--madplay 编译和移植 简记
  11. ssh框架整合之登录以及增删改查
  12. 设计模式理解(八)结构型——装饰者模式(记得加上UML图 --- 未完)
  13. PL/SQL Developper导入导出数据库的方法及说明
  14. git clean(转载)
  15. elasticsearch-head连接不上es
  16. Spring Boot Maven 打包 Jar
  17. PHP代码审计笔记--命令执行漏洞
  18. 一款简单实用的jQuery图片画廊插件
  19. 网站精准查询IP
  20. [ 原创 ] Java基础9--final throw throws finally的区别

热门文章

  1. [转]ASP.NET应用程序生命周期趣谈(五) IIS7瞎说
  2. zend studio 快捷键
  3. Java中堆内存和栈内存详解2
  4. [project euler] program 4
  5. vsftpd 安装配置
  6. 重邮二进制日天群-pwn1
  7. SQL 中的 AND OR
  8. Django rest_framework 实用技巧
  9. hihocoder-1014 Trie树
  10. framebuffer line_length 參數