多处理不保存数据

时间:2015-09-06 17:56:18

标签: python python-3.x beautifulsoup multiprocessing

我的程序基本上是从我编写的网站上删除图像。我有3个功能,每个功能都使用参数从特定网站上删除图像。我的程序包含以下代码。

import requests
from bs4 import BeautifulSoup
from multiprocessing import Process

img1 = []
img2 = []
img3 = []

def my_func1(img_search):
    del img1[:]

    url1 = "http://www.somewebsite.com/" + str(img_search)
    r1 = requests.get(url1)
    soup1 = BeautifulSoup(r1.content)
    data1 = soup1.find_all("div",{"class":"img"})

    for item in data1:
        try:
            img1.append(item.contents[0].find('img')['src'])
        except:
            img1.append("img Unknown")
    return

def my_func2(img_search):
    del img2[:]

    url2 = "http://www.somewebsite2.com/" + str(img_search)
    r2 = requests.get(url2)
    soup2 = BeautifulSoup(r2.content)
    data2 = soup2.find_all("div",{"class":"img"})

    for item in data2:
        try:
            img2.append(item.contents[0].find('img')['src'])
        except:
            img2.append("img Unknown")
    return

def my_func3(img_search):
    del img3[:]

    url3 = "http://www.somewebsite3.com/" + str(img_search)
    r3 = requests.get(url3)
    soup3 = BeautifulSoup(r3.content)
    data3 = soup3.find_all("div",{"class":"img"})

    for item in data3:
        try:
            img3.append(item.contents[0].find('img')['src'])
        except:
            img3.append("img Unknown")
    return

my_func1("orange cat")
my_func2("blue cat")
my_func3("green cat")

print(*img1, sep='\n')
print(*img2, sep='\n')
print(*img3, sep='\n')

抓取工作正常,但速度很慢所以我决定使用多处理来加速它,而多处理确实加快了它的速度。我基本上用这个

替换了函数调用
p = Process(target=my_func1, args=("orange cat",))
p.start()
p2 = Process(target=my_func2, args=("blue cat",))
p2.start()
p3 = Process(target=my_func3, args=("green cat",))
p3.start()

p.join()
p2.join()
p3.join()

但是,当我打印img1,img2和img3列表时,它们是空的。我该如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

当您使用multiprocessing在多个进程之间分配工作时,每个进程将在单独的命名空间(主进程的命名空间的副本)中运行。您在子进程的命名空间中所做的更改不会反映在父进程的命名空间中。您需要使用multiprocessing.Queue或其他一些同步方法从工作进程传回数据。

在您的示例代码中,您的三个功能几乎完全相同,只有网站的域名和变量名称不同。如果这就是你的真实函数的外观,我建议使用multiprocessing.Pool.map并将整个URL传递给单个函数,而不是只传递搜索词:

def my_func(search_url):
    r = requests.get(search_url)
    soup = BeautifulSoup(r.content)
    data = soup.find_all("div",{"class":"img"})
    images = []
    for item in data:
        try:
            images.append(item.contents[0].find('img')['src'])
        except:
            images.append("img Unknown")
    return images

if __name__ == "__main__":
    searches = ['http://www.somewebsite1.com/?orange+cat', # or whatever
                'http://www.somewebsite2.com/?blue+cat',
                'http://www.somewebsite3.com/?green+cat']
    pool = multiprocessing.Pool() # will create as many processes as you have CPU cores
    results = pool.map(my_func, searches)
    pool.close()
    # do something with results, which will be a list with of the function return values