BeautifulSoup需要永远,这可以更快完成吗?

时间:2015-05-16 14:54:00

标签: python beautifulsoup

我正在使用Raspberry Pi 1B + w / Debian Linux:

Linux rbian 3.18.0-trunk-rpi #1 PREEMPT Debian 3.18.5-1~exp1+rpi16 (2015-03-28) armv6l GNU/Linux

作为更大的Python程序的一部分,我正在使用此代码:

#!/usr/bin/env python

import time
from urllib2 import Request, urlopen
from bs4 import BeautifulSoup

_url="http://xml.buienradar.nl/"

s1  = time.time()
req = Request(_url)
print "Request         = {0}".format(time.time() - s1)
s2 = time.time()
response = urlopen(req)
print "URLopen         = {0}".format(time.time() - s2)
s3 = time.time()
output = response.read()
print "Read            = {0}".format(time.time() - s3)
s4 = time.time()
soup = BeautifulSoup(output)
print "Soup (1)        = {0}".format(time.time() - s4)

s5 = time.time()
MSwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windsnelheidms)
GRwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windrichtinggr)
ms = MSwind.replace("<"," ").replace(">"," ").split()[1]
gr = GRwind.replace("<"," ").replace(">"," ").split()[1]
print "Extracting info = {0}".format(time.time() - s5)

s6 = time.time()
soup = BeautifulSoup(urlopen(_url))
print "Soup (2)         = {0}".format(time.time() - s6)

s5 = time.time()
MSwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windsnelheidms)
GRwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windrichtinggr)
ms = MSwind.replace("<"," ").replace(">"," ").split()[1]
gr = GRwind.replace("<"," ").replace(">"," ").split()[1]
print "Extracting info = {0}".format(time.time() - s5)

当我运行它时,我得到了这个输出:

Request         = 0.00394511222839
URLopen         = 0.0579500198364
Read            = 0.0346400737762
Soup (1)        = 23.6777830124
Extracting info = 0.183892965317
Soup (2)         = 36.6107468605
Extracting info = 0.382317781448

因此,BeautifulSoup命令大约需要半分钟来处理_url。 如果可以在10秒内完成,我真的很喜欢它。

任何可以显着加快代码速度的建议(至少-60%)都是非常受欢迎的。

2 个答案:

答案 0 :(得分:4)

安装lxml库;一旦安装,BeautifulSoup将使用它作为默认解析器。

lxml使用libxml2 C库解析页面,这比使用纯Python实现的默认html.parser后端要快得多。

然后,您还可以将页面解析为 XML 而不是HTML:

soup = BeautifulSoup(output, 'xml')

使用lxml解析您的给定页面应该更快;我每秒可以解析页面几乎50次:

>>> timeit("BeautifulSoup(output, 'xml')", 'from __main__ import BeautifulSoup, output', number=50)
1.1700470447540283

不过,我想知道你是否遗漏了其他一些Python加速库,因为即使使用内置的解析器,我也无法重现你的结果:

>>> timeit("BeautifulSoup(output, 'html.parser')", 'from __main__ import BeautifulSoup, output', number=50)
1.7218239307403564

也许你的内存受限并且大文档会导致你的操作系统交换内存很多?内存交换(将页面写入磁盘并从磁盘加载其他页面)可以使最快的程序停止运行。

请注意,您无需在代码元素上使用str()并拆分代码,只需使用.string attribute即可从代码中获取值:

station_6350 = soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350)
ml = station_6350.windsnelheidMS.string
gr = station_6350.windrichtingGR.string

如果您使用的是XML解析器,请考虑标记名必须与大小写匹配(HTML是不区分大小写的标记语言)。

由于这是一个XML文档,另一个选择是使用lxml ElementTree模型;您可以使用XPath表达式来提取数据:

from lxml import etree 

response = urlopen(_url)
for event, elem in etree.iterparse(response, tag='weerstation'):
    if elem.get('id') == '6350':
        ml = elem.find('windsnelheidMS').text
        gr = elem.find('windrichtingGR').text
        break
    # clear elements we are not interested in, adapted from
    # http://stackoverflow.com/questions/12160418/why-is-lxml-etree-iterparse-eating-up-all-my-memory
    elem.clear()
    for ancestor in elem.xpath('ancestor-or-self::*'):
        while ancestor.getprevious() is not None:
            del ancestor.getparent()[0]

这应该只构建所需的 minimal 对象树,清除你在文档中不需要的气象站。

演示:

>>> from lxml import etree
>>> from urllib2 import urlopen
>>> _url = "http://xml.buienradar.nl/"
>>> response = urlopen(_url)
>>> for event, elem in etree.iterparse(response, tag='weerstation'):
...     if elem.get('id') == '6350':
...         ml = elem.find('windsnelheidMS').text
...         gr = elem.find('windrichtingGR').text
...         break
...     # clear elements we are not interested in
...     elem.clear()
...     for ancestor in elem.xpath('ancestor-or-self::*'):
...         while ancestor.getprevious() is not None:
...             del ancestor.getparent()[0]
... 
>>> ml
'4.64'
>>> gr
'337.8'

答案 1 :(得分:0)

使用requests和正则表达式可以更短更快。对于这种相对简单的数据收集,正则表达式工作正常。

#!/usr/bin/env python

from __future__ import print_function
import re
import requests
import time

_url = "http://xml.buienradar.nl/"
_regex = '<weerstation id="6391">.*?'\
         '<windsnelheidMS>(.*?)</windsnelheidMS>.*?'\
         '<windrichtingGR>(.*?)</windrichtingGR>'

s1 = time.time()
br = requests.get(_url)
print("Request         = {0}".format(time.time() - s1))
s5 = time.time()
MSwind, GRwind = re.findall(_regex, br.text)[0]
print("Extracting info = {0}".format(time.time() - s5))
print('wind speed', MSwind, 'm/s')
print('wind direction', GRwind, 'degrees')

在我的桌面上(虽然不是覆盆子:-)),但运行速度非常快;

Request         = 0.0723416805267334
Extracting info = 0.0009412765502929688
wind speed 2.35 m/s
wind direction 232.6 degrees

当然,如果windsnelheidMSwindrichtingGR代码被撤消,则此特定正则表达式将失败。但鉴于XML很可能是计算机生成的,似乎不太可能。 并且有一个解决方案。首先使用正则表达式捕获<weerstation id="6391"></weerstation>之间的文本,然后使用另外两个正则表达式来查找风速和方向。