Question

我有这个用Python 3编写的脚本：

response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
    html = BeautifulSoup(response, 'html.parser')
    title = html.select("#firstHeading")[0].text

如您所见，我可以从文章中获得标题，但是我不知道如何从“数学（从希腊语μά...”到内容表）中获取文本。

Answer 1

从Wikipedia中获取信息的方法要简单得多- Wikipedia API 。

有this Python wrapper，它使您仅用零HTML解析就可以在几行中完成它：

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

page = wiki_wiki.page('Mathematics')
print(page.summary)

打印：

数学（来自希腊语μάθημαmáthēma，“知识，学习，学习”）包括对诸如数量，结构，空间和更改...（故意省略）

而且，通常，如果有直接的API，请尽量避免刮屏。

Answer 2

选择<p>标签。有52个元素。不知道您是否想要整个东西，但是您可以遍历这些标签以尽可能地存储它。我只是选择打印它们中的每一个以显示输出。

import bs4
import requests


response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
    html = bs4.BeautifulSoup(response.text, 'html.parser')

    title = html.select("#firstHeading")[0].text
    paragraphs = html.select("p")
    for para in paragraphs:
        print (para.text)

    # just grab the text up to contents as stated in question
    intro = '\n'.join([ para.text for para in paragraphs[0:5]])
    print (intro)

Answer 3

使用库wikipedia

import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)

Answer 4

您可以使用lxml库获得所需的输出，如下所示。

import requests
from lxml.html import fromstring

url = "https://en.wikipedia.org/wiki/Mathematics"

res = requests.get(url)
source = fromstring(res.content)
paragraph = '\n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)

使用BeautifulSoup：

from bs4 import BeautifulSoup
import requests

res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
    if item.text.startswith("The history"):break
    print(item.text)

Answer 5

您似乎想要的（HTML）页面内容没有周围的导航元素。正如我在this earlier answer from 2013中所述，有（至少）两种方法可以获取它：

在这种情况下，最简单的方法可能是在URL中包含参数action=render，就像https://en.wikipedia.org/wiki/Mathematics?action=render一样。这只会给您内容HTML，而没有其他内容。
或者，您也可以像MediaWiki API一样，通过https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics获取页面内容。

使用API的优势在于，它还可以为您提供a lot of other information有关可能有用的页面的信息。例如，如果您想在页面的侧边栏中显示通常显示的中间语言链接的列表，或者在内容区域的下面正常显示的类别，则可以从API中获得以下链接：

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

（要获得具有相同请求的页面内容，请使用prop=langlinks|categories|text。）

有几个Python libraries for using the MediaWiki API可以使使用它的一些细节变得自动化，尽管它们支持的功能集可能有所不同。就是说，完全可以直接从代码中直接使用API，而无需在两者之间使用库。

Answer 6

要正确使用函数，您只需获取Wikipedia提供的 JSON API ：

from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads


def getJSON(page):
    params = urlencode({
        'format': 'json',
        'action': 'parse',
        'prop': 'text',
        'redirects' : 'true',
        'page': page})
    API = "https://en.wikipedia.org/w/api.php"
    response = urlopen(API + "?" + params)
    return response.read().decode('utf-8')


def getRawPage(page):
    parsed = loads(getJSON(page))
    try:
        title = parsed['parse']['title']
        content = parsed['parse']['text']['*']
        return title, content
    except KeyError:
        # The page doesn't exist
        return None, None

title, content = getRawPage("Mathematics")

然后可以将其与任何要提取所需内容的库一起解析：）

Answer 7

我使用这个：通过“idx”我可以确定我想阅读哪个段落。

from from bs4 import BeautifulSoup
import requests

res = requests.get("https://de.wikipedia.org/wiki/Pferde")
soup = BeautifulSoup(res.text, 'html.parser')
for idx, item in enumerate(soup.find_all("p")):
    if idx == 1:
        break
print(item.text)

如何使用Python 3和Beautiful Soup获取Wikipedia文章的文本？

7 个答案: