从维基百科文章中抓取所有标题和纯文本

时间:2016-11-02 16:48:21

标签: python python-2.7 beautifulsoup

在Python中,我将如何从维基百科文章中抓取所有标题和平面文本,例如:https://en.wikipedia.org/wiki/Amadeus_(film)。我目前的代码是:

    from bs4 import BeautifulSoup


# ---- Definitions ----#
#Amount of documents
amount_of_documents = 1

#Directory of raw HTML documents
directory_of_raw_documents = "raw_documents/"

#Directory of parsed documents
directory_of_parsed_documents = "parsed_documents/"

# ---- Code ----#


def open_document():
    for i in range (1, 1+1):
        with open(directory_of_raw_documents + str(i), "r") as document:
            html = document.read()
            soup = BeautifulSoup(html, "html.parser")
            body = soup.find('div', id='bodyContent')
            for elements in body.find_all('p'):
                print(elements.text)

open_document()

我正在加载下载的HTML文件,然后使用BeautifulSoup获取<p>标记之间的所有内容。我的目标是获取本文的所有标题和纯文本内容。我该怎么做呢?

在上面发布的示例中,我想要的输出将包含:

  1. 所有标题(Amadeus(电影),剧情,演员,接待等)
  2. 此页面中的所有文字(<p>标签之间)
  3. IGNORING references

1 个答案:

答案 0 :(得分:1)

您可能有兴趣使用专门的维基百科页面解析器,例如wikipedia package。这样您就可以轻松获得内容:

In [1]: import wikipedia

In [2]: page = wikipedia.page("Amadeus (film)")

In [3]: page.summary
Out[3]: u"Amadeus is a 1984 American period drama film directed by Milo\u0161 Forman, written by Peter Shaffer, and adapted from Shaffer's stage play Amadeus (1979). The story, set in Vienna, Austria, during the latter half of the 18th century, is a fictionalized biography of Wolfgang Amadeus Mozart. Mozart's music is heard extensively in the soundtrack of the movie. Its central thesis is that Antonio Salieri, an Italian contemporary of Mozart is so driven by jealousy of the latter and his success as a composer that he plans to kill him and to pass off a Requiem, which he secretly commissioned from Mozart as his own, to be premiered at Mozart's funeral. Historically, the Requiem which was never finished was commissioned by Count von Walsegg and Salieri, far from being jealous of Mozart, was on good terms with him and even tutored his son after Mozart's death.\nThe film was nominated for 53 awards and received 40, which included eight Academy Awards (including Best Picture), four BAFTA Awards, four Golden Globes, and a Directors Guild of America (DGA) award. As of 2016, it is the most recent film to have more than one nomination in the Academy Award for Best Actor category. In 1998, the American Film Institute ranked Amadeus 53rd on its 100 Years... 100 Movies list."

In [4]: page.content
Out[4]: u'Amadeus is a 1984 American period drama film directed by Milo\u0161 Forman, written by Peter Shaffer, and adapted from Shaffer\'s s
...
Amadeus Filming locations at Movieloci.com'

至于获取标题,以下是通过BeautifulSoup获取它们的示例代码:

In [1]: import requests

In [2]: from bs4 import BeautifulSoup

In [3]: url = "https://en.wikipedia.org/wiki/Amadeus_(film)"

In [4]: response = requests.get(url)

In [5]: soup = BeautifulSoup(response.content, "html.parser")

In [6]: [item.get_text() for item in soup.select("h2 .mw-headline")]
Out[6]: 
[u'Plot',
 u'Cast',
 u'Production',
 u'Reception',
 u'Alternative versions',
 u'Music',
 u'Awards and nominations',
 u'References',
 u'External links']

h2 .mw-headlineCSS selector,可与mw-headline父元素下的h2类元素匹配。