慢速HTML解析器。如何提高速度?

时间:2014-04-01 14:52:18

标签: python html performance parsing html-parsing

我想估计新闻对道琼斯报价的影响。为此,我使用beutifullsoup库编写了Python html解析器。我提取一篇文章并将其存储在XML文件中,以便使用NLTK库进行进一步分析。如何提高解析速度?下面的代码执行所需的任务,但速度很慢。

以下是html解析器的代码:

import urllib2
import re
import xml.etree.cElementTree as ET
import nltk
from bs4 import BeautifulSoup
from datetime import date
from dateutil.rrule import rrule, DAILY
from nltk.corpus import stopwords
from collections import defaultdict

def main_parser():
    #starting date
    a = date(2014, 3, 27)
    #ending date
    b = date(2014, 3, 27)
    articles = ET.Element("articles")
    f = open('~/Documents/test.xml', 'w')
    #loop through the links and per each link extract the text of the article, store the latter at xml file
    for dt in rrule(DAILY, dtstart=a, until=b):
        url = "http://www.reuters.com/resources/archive/us/" + dt.strftime("%Y") + dt.strftime("%m") + dt.strftime("%d") + ".html"
        page = urllib2.urlopen(url)
        #use html5lib ??? possibility to use another parser
        soup = BeautifulSoup(page.read(), "html5lib")
        article_date = ET.SubElement(articles, "article_date")
        article_date.text = str(dt)
        for links in soup.find_all("div", "headlineMed"):
            anchor_tag = links.a
            if not 'video' in anchor_tag['href']:
                try:
                    article_time = ET.SubElement(article_date, "article_time")
                    article_time.text = str(links.text[-11:])

                    article_header = ET.SubElement(article_time, "article_name")
                    article_header.text = str(anchor_tag.text)

                    article_link = ET.SubElement(article_time, "article_link")
                    article_link.text = str(anchor_tag['href']).encode('utf-8')

                    try:
                        article_text = ET.SubElement(article_time, "article_text")
                        #get text and remove all stop words
                        article_text.text = str(remove_stop_words(extract_article(anchor_tag['href']))).encode('ascii','ignore')
                    except Exception:
                        pass
                except Exception:
                    pass

    tree = ET.ElementTree(articles)
    tree.write("~/Documents/test.xml","utf-8")

#getting the article text from the spicific url
def extract_article(url):
    plain_text = ""
    html = urllib2.urlopen(url).read()
    soup = BeautifulSoup(html, "html5lib")
    tag = soup.find_all("p")
    #replace all html tags
    plain_text = re.sub(r'<p>|</p>|[|]|<span class=.*</span>|<a href=.*</a>', "", str(tag))
    plain_text = plain_text.replace(", ,", "")
    return str(plain_text)

def remove_stop_words(text):
    text=nltk.word_tokenize(text)
    filtered_words = [w for w in text if not w in stopwords.words('english')]
    return ' '.join(filtered_words)

2 个答案:

答案 0 :(得分:1)

可以应用几个修复程序(无需更改当前使用的模块):

  • 使用lxml解析器而不是html5lib - 它更快(并且更多3个muches)
  • 仅使用SoupStrainer解析文档的一部分(请注意html5lib不支持SoupStrainer - 它将始终缓慢地解析整个文档)

以下是更改后代码的外观。简短的性能测试表明至少提高了3倍:

import urllib2
import xml.etree.cElementTree as ET
from datetime import date

from bs4 import SoupStrainer, BeautifulSoup
import nltk
from dateutil.rrule import rrule, DAILY
from nltk.corpus import stopwords


def main_parser():
    a = b = date(2014, 3, 27)
    articles = ET.Element("articles")
    for dt in rrule(DAILY, dtstart=a, until=b):
        url = "http://www.reuters.com/resources/archive/us/" + dt.strftime("%Y") + dt.strftime("%m") + dt.strftime(
            "%d") + ".html"

        links = SoupStrainer("div", "headlineMed")
        soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=links)

        article_date = ET.SubElement(articles, "article_date")
        article_date.text = str(dt)
        for link in soup.find_all('a'):
            if not 'video' in link['href']:
                try:
                    article_time = ET.SubElement(article_date, "article_time")
                    article_time.text = str(link.text[-11:])

                    article_header = ET.SubElement(article_time, "article_name")
                    article_header.text = str(link.text)

                    article_link = ET.SubElement(article_time, "article_link")
                    article_link.text = str(link['href']).encode('utf-8')

                    try:
                        article_text = ET.SubElement(article_time, "article_text")
                        article_text.text = str(remove_stop_words(extract_article(link['href']))).encode('ascii', 'ignore')
                    except Exception:
                        pass
                except Exception:
                    pass

    tree = ET.ElementTree(articles)
    tree.write("~/Documents/test.xml", "utf-8")


def extract_article(url):
    paragraphs = SoupStrainer('p')
    soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=paragraphs)
    return soup.text


def remove_stop_words(text):
    text = nltk.word_tokenize(text)
    filtered_words = [w for w in text if not w in stopwords.words('english')]
    return ' '.join(filtered_words)

请注意,我已从extract_article()中删除了正则表达式处理 - 看起来您可以从p标记中获取整个文本。

我可能已经介绍了一些问题 - 请检查一切是否正确。


另一个解决方案是使用lxml来解析(替换beautifulSoup)到创建xml(替换xml.etree.ElementTree)。


另一种解决方案(绝对是最快的)是切换到Scrapy网络抓取网络框架。 它很简单,也很快。有各种各样的电池,你可以想象,包括在内。例如,有链接提取器,XML导出器,数据库管道等。值得一看。

希望有所帮助。

答案 1 :(得分:0)

您想选择最好的解析器。 python parser benchmark result

我们在构建http://serpapi.com

时对大多数解析器/平台进行基准测试

这里是有关Medium的全文: https://medium.com/@vikoky/fastest-html-parser-available-now-f677a68b81dd