Question

我是python和beautifulsoup的新手，我写了这段代码（仅附加相关部分）。然而，它在执行期间非常慢，大约需要8秒钟（我需要循环几千次）。

你能指点一下如何让它更快吗？每一个批评都受到欢迎。

PS。这可能是相关的：每页有20行，列0..5是短字符串，最多100个字符，第6列更大，它是一个最多2000个字符的字符串， requests.get（...）需要大约0.2秒

    ReqHTMLContent = bs4.BeautifulSoup(ReqResult.text)


    ###############################################
    #print('Adding report ...', flush=True) 

    for TableRow in ReqHTMLContent.select('table#msgTable tr'):  
        #print (TableRow)

        RpName = TableRow.find_all('td')[0].get_text(strip=True)            
        RpArray[row][0] = RpName
        #print(RpName)

        RpCategory = TableRow.find_all('td')[1].get_text(strip=True)
        RpArray[row][1] = RpCategory
        #print(RpCategory)

        RpType = TableRow.find_all('td')[2].get_text(strip=True)
        RpArray[row][2] = RpType
        #print(RpType)

        RpTime = TableRow.find_all('td')[3].get_text(strip=True)
        RpArray[row][3] = RpTime
        #print(RpTime)

        RpTitle = TableRow.find_all('td')[4].get_text(strip=True)
        RpArray[row][4] = RpTitle
        #print(RpTitle)

        #linki i tresc raportu  
        for link in TableRow.find_all("a", attrs={"class": "evLK"}):
            RpLink = domain_url + link.get('href')  
            RpArray[row][5] = RpLink
            #print(RpLink) 

            #tresc raportu
            RpHtml = requests.get(RpLink)   
            RpRaw = bs4.BeautifulSoup(RpHtml.text)          

            #<div id="ctl00_Body_msgDetails1_eventReport" class="ItemA">     
            RpTable = RpRaw.find("div", attrs={"id": "ctl00_Body_msgDetails1_eventReport", "class": "ItemA"})                    
            RpText = RpTable.get_text("|", strip=True)
            RpArray[row][6]=RpText            
            #print(RpText)


        row += 1               
    ### for TableRow in ReqHTMLContent.select('table#msgTable tr'):
    ###############################################

Answer 1

除了其他建议外，请使用SoupStrainer仅解析文档的一部分。

此处修改后的代码包含其他小修正：

from bs4 import SoupStrainer, BeautifulSoup
import requests

# we'll use "div" strainer later
div = SoupStrainer("div", attrs={"id": "ctl00_Body_msgDetails1_eventReport", "class": "ItemA"})

rows = SoupStrainer("table", id="msgTable")
soup = BeautifulSoup(ReqResult.content, parse_only=rows)

results = []
for row in soup.select('table#msgTable tr'):
    cells = [td.get_text(strip=True) for td in row.find_all('td')]

    for link in row.select("a.evLK"):
        url = domain_url + link.get('href')
        cells.append(url)

        inner_soup = BeautifulSoup(requests.get(url).content, parse_only=div)

        table = inner_soup.find("div", attrs={"id": "ctl00_Body_msgDetails1_eventReport", "class": "ItemA"})
        cells.append(table.get_text("|", strip=True))

    results.append(cells)

作为旁注，正如其他人已经提到的那样，关键问题在于循环中的链接。由于这是一个同步操作，它阻止程序执行并使其变慢 - 在完成前一个链接之前，您无法关注下一个链接。切换到异步方法可以显着提高性能。以下是几个选项：

Scrapy（基于扭曲的网络抓取框架）
grequests（requests + gevent）

Answer 2

不要多次使用BS的find_all功能，而是尝试一劳永逸地使用它：

RpList = TableRow.find_all('td')

RpName = RpList[0].get_text(strip=True)
RpCategory = RpList[1].get_text(strip=True)
RpType = RpList[2].get_text(strip=True)
RpTime = RpList[3].get_text(strip=True)
RpTitle = RpList[4].get_text(strip=True)

这不仅限于该示例。正如所建议的那样，您可以使用列表推导来减少代码行数。但是，与调用BS功能的成本相比，创建RpName，RpType ...变量的成本无关紧要。因此，如果它有助于清晰的代码，您可以保留它。

基本上，这个想法是使用BS最小化，而Python最大化。

除此之外，我认为代码中代价最高的部分就是这一行：

RpHtml = requests.get(RpLink)

哪个在你的嵌套循环中。但是如果你需要访问这么多链接，因为你需要一些你无法在其他地方找到的东西，我就无法看到你将如何削减它。

尝试确定此行的执行次数，因为您说它需要大约0.2秒。如果它被呼叫，让我们说，40次，那么你有答案。

如果您想测试requests.get()来电的净费用，请执行此操作：

from time import time
start = time()
calls = 0
for link in TableRow.find_all("a", attrs={"class": "evLK"}):
    RpLink = domain_url + link.get('href')  
    RpArray[row][5] = RpLink
    calls += 1
print "get() was called %d times and took %d seconds"%(calls,time()-start)

Answer 3

你可以改变这个：

RpName = TableRow.find_all('td')[0].get_text(strip=True)            
RpArray[row][0] = RpName
#print(RpName)

RpCategory = TableRow.find_all('td')[1].get_text(strip=True)
RpArray[row][1] = RpCategory
#print(RpCategory)

RpType = TableRow.find_all('td')[2].get_text(strip=True)
RpArray[row][2] = RpType
#print(RpType)

RpTime = TableRow.find_all('td')[3].get_text(strip=True)
RpArray[row][3] = RpTime
#print(RpTime)

RpTitle = TableRow.find_all('td')[4].get_text(strip=True)
RpArray[row][4] = RpTitle
#print(RpTitle)

对此：

RpArray[row] = [td.get_text(strip=True) for td in TableRow.find_all('td')]

如果你想使用其中一个值，你可以这样做：

RpName = RpArray[row][0]

Answer 4

加入Vincent Beltman和Jivan的回答：

RpList = TableRow.find_all('td')
RpArray[row] = [td.get_text(strip=True) for td in RpList]

查找所有＆＃39; td＆＃39;只需一次，并用一个表达式循环。

加速代码（beautifulsoup，python）

4 个答案: