生产者消费者模式

认识生产者和消费者模式

生产者和消费者是异步爬虫中很常见的一个问题。产生数据的模块，我们称之为生产者，而处理数据的模块，就称为消费者。

例如：

图片数据爬取中，解析出图片链接的操作就是在生产数据

对图片链接发起请求下载图片的操作就是在消费数据

为什么要使用生产者和消费者模式

在异步世界里，生产者就是生产数据的线程，消费者就是消费数据的线程。在多线程开发当中，如果生产者处理速度很快，而消费者处理速度很慢，那么生产者就必须等待消费者处理完，才能继续生产数据。同样的道理，如果消费者的处理能力大于生产者，那么消费者就必须等待生产者。为了解决这个问题于是引入了生产者和消费者模式。

import requests

import threading

from lxml import etree

from queue import Queue

from urllib.request import urlretrieve

import os

# filename = 'imgs'

# if not os.path.exists(filename):

#     os.mkdir(filename)

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',

}

# https://pic.netbian.com/4kmeinv/

# 1.创建两个数据模型类

# 1.1生产数据：解析提取图片地址

class Producer(threading.Thread):  # 生产者线程

    # 6.构造生产者模型生产方法

    def __init__(self, page_queue, img_queue):

        # 7.调用父类的构造方法继承

        super().__init__()

        self.page_queue = page_queue

        self.img_queue = img_queue

    # 7.给生产者模型赋予任务:不断的生产数据

    def run(self):

        # print('正在执行Producer')

        while True:

            # 8.判断生产者队列是否为空

            if self.page_queue.empty():  # 如果判断为空，则表示所有连接已经请求完成，结束请求

                # print('结束执行Producer')

                break

            # 9.从page_queue中取出一个页码链接

            url = self.page_queue.get()

            # print(url)

            # 从当前的页码对应的页面中解析出更多的图片地址

            self.parse_detail(url)

    # 10.定义一个解析数据方法

    def parse_detail(self, url):

        response = requests.get(url=url, headers=headers)

        response.encoding = 'gbk'

        page_text = response.text

        tree = etree.HTML(page_text)

        li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')

        for li in li_list:

            img_src = 'https://pic.netbian.com' + li.xpath('./a/img/@src')[0]

            img_title = li.xpath('./a/b/text()')[0] + '.jpg'

            # 11.将src和title封装成字典

            dic = {

                'src': img_src,

                'title': img_title

            }

            # print(dic)

            # 12.将字典传递到消费者队列

            self.img_queue.put(dic)

# 1.2消费数据：对图片地址进行数据请求

class Consumer(threading.Thread):  # 消费者线程

    # 13.消费者将每一个图片数据做请求并解析存储

    # 构建类方法（构造方法固定）

    def __init__(self, page_queue, img_queue):

        super().__init__()

        self.page_queue = page_queue

        self.img_queue = img_queue

    # 14.给消费者模型赋予任务:不断的消费数据

    def run(self):

        # print('正在执行Consumer')

        # 15.判断消费者队列和生产者队列是否为空

        while True:

            # 16.若二者都为空，则表示生产者队列和生产者队列均无数据可做请求解析

            if self.img_queue.empty() and self.page_queue.empty():

                # print('结束执行Consumer')

                break

                # 17.如不为空，则表示还有待处理的数据，则取出继续处理

                # img_queue:队列中传送过来的数据为字典,从字典中取出数据

            dic = self.img_queue.get()

            title = dic['title']

            src = dic['src']

            # 18.urlretrieve可以直接对图片地址发请求并做持久化存储

            urlretrieve(src, 'imgs/' + title)

            print(title, '下载完成！')

def main():

    # 2.创建队列

    # 2.1该队列中存储将要爬取的页面页码链接

    page_queue = Queue(30)  # 队列当中最多能存10个链接元素

    # 2.2该队列存储生产者生产出来的图片地址

    img_queue = Queue(80)  # 队列中最多能存储50个链接元素

    # 3.循环获取页面页码链接

    # 该循环可以将2,3，4这三个页码链接放入page_queue

    for x in range(2, 15):

        url = 'https://pic.netbian.com/4kmeinv/index_%d.html' % x

        # 将每一个页面页码链接添加到队列中

        page_queue.put(url)

        # print(url)

    # print(page_queue)

    # 4.生产者生产线程

    # 创建三个生产者线程并启动

    for x in range(3):

        t = Producer(page_queue, img_queue)

        t.start()

    # 5.消费者消费线程

    # 创建三个消费者线程并启动

    for x in range(3):

        t = Consumer(page_queue, img_queue)

        t.start()

main()

巴特西

Day 22 22.3：生产者和消费者模式

生产者消费者模式

认识生产者和消费者模式

为什么要使用生产者和消费者模式

最新文章

热门文章