Question

我是Python编程的新手。

我正在尝试解析来自Instagram的HTTP请求，以使用正则表达式查找特定单词。

我使用过多处理，但它仍然很慢。我知道我的代码可能看起来很愚蠢，但那是我最好的。

我做错了什么让它变慢？我需要让它更快地发送多个HTTP请求。

import requests
import re 
import time
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool  
from multiprocessing import cpu_count


Nthreads = cpu_count()*2
pool = Pool(Nthreads)


f = open('full.txt','r')
fw = open('out.txt', 'w')


def findSnap(bio):
    regex = 'content=".*sn[a]*p[a-z]*\s*[^a-z0-9].*'
    snap = re.findall(regex, bio)
    if not snap:
        return None
    else:
        afterSnap = re.sub('content=".*sn[a]*p[a-z]*\s*[^a-z0-9]*\s*','',snap[0])
        if afterSnap:
            afterSnap = re.findall('[\w_\.-]*',afterSnap)[0]
            sftS = afterSnap.split()
            if sftS:
                return sftS[0]
            return None
        return None

def loadInfo(url):
    #print 'Loading data..'
    st = time.time
    try:
        page = requests.get(url).text.lower()
    except Exception as e:
        print('Something is wrong!')
        return None


    snap = findSnap(page)
    if snap:
        fw.write(snap + '\n')
        fw.flush()
        print(snap)
    else:
        return None
    return snap

start = time.time()
names = f.read().splitlines()
baseUrl = 'https://instagram.com/'
urls = map(lambda x: baseUrl + x, names)

pool.map(loadInfo, urls)
finish = time.time()

print((finish- start)/60)
fw.close()

Answer 1

正如某些人所说，也许我们需要更多有关您获得什么时间，期望什么以及为什么期望的更多细节。因为您的应用程序依赖第三方资源，所以应用程序的执行时间可能涉及很多因素，而不仅仅是您的代码。

无论如何，我已经看到您正在使用multiprocessing.dummy，这只是threading模块[1]的包装。根据其文档，似乎它不是可用于同时运行常规Python代码的最佳模块[2]：

CPython实现细节：在CPython中，由于全局解释器锁定，只有一个线程可以一次执行Python代码（即使某些面向性能的库可能会克服此限制）。如果您希望您的应用程序更好地利用建议您使用多核计算机的计算资源使用multiprocessing或current.futures.ProcessPoolExecutor。但是，如果要运行，线程仍然是合适的模型同时执行多个I / O绑定任务。

的确，您要进行I / O操作，但是处理正则表达式也是一项繁重的任务。

正如文本中所述，您可以尝试使用池的不同实现在multiprocessing之外的dummy模块中，也可以concurrent.futures.ProcessPoolExecutor。

Python - 多个HTTP请求太慢

1 个答案: