Question

我正在尝试抓取Wikipedia信息框并获取某些关键字的信息。例如：https://en.wikipedia.org/wiki/A%26W_Root_Beer

比方说，我正在寻找制造商的值。我希望它们出现在列表中，而我只想要它们的文本。因此，在这种情况下，所需的输出将为['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']。无论我如何尝试，都无法成功生成此列表。这是我的一部分代码：

url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:

        th = tr.find("th")
        td = tr.find("td")

    # take th.text and td.text

我想要一种在各种情况下都可以使用的方法：当方式中有换行符时，某些值是链接时，某些值是段落时等等。在所有情况下，我只希望我们在屏幕上看到的文本，不是链接，不是段落，只是纯文本。我也不希望输出为Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada)，因为稍后我希望能够解析结果并对每个实体执行某些操作。

我正在浏览许多Wikipedia页面，但找不到适合其中很大一部分的方法。您能帮我提供工作代码吗？我不擅长抓取。

Answer 1

好的，这是我的尝试（json库仅用于漂亮地打印字典）：

fixedEncodeURIComponent(yourUrl) (JavaScript) = (PHP) rawurlencode(yourUrl)

该代码将import json from bs4 import BeautifulSoup import requests url = "https://en.wikipedia.org/wiki/ABC_Studios" r = requests.get(url) soup = BeautifulSoup(r.text, "lxml") tbl = soup.find("table", {"class": "infobox vcard"}) list_of_table_rows = tbl.findAll('tr') info = {} for tr in list_of_table_rows: th = tr.find("th") td = tr.find("td") if th is not None: innerText = '' for elem in td.recursiveChildGenerator(): if isinstance(elem, str): innerText += elem.strip() elif elem.name == 'br': innerText += '\n' info[th.text] = innerText print(json.dumps(info, indent=1))标记替换为<br/>，从而得到：

\n

如果您要返回列表而不是带有{ "Trading name": "ABC Studios", "Type": "Subsidiary\nLimited liability company", "Industry": "Television production", "Predecessor": "Touchstone Television", "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)", "Headquarters": "Burbank, California,U.S.", "Area served": "Worldwide", "Key people": "Patrick Moran (President)", "Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)", "Website": "abcstudios.go.com" }的字符串，则可以对其进行调整

\n

哪个给：

    innerTextList = innerText.split("\n")
    if len(innerTextList) < 2:
        info[th.text] = innerTextList[0]
    else:
        info[th.text] = innerTextList

Answer 2

此代码无效

soup = BeautifulSoup(requests.get(url), "lxml")

BeautifulSoup需要requests内容，附加.text或.content。

要获得预期的制造结果，您需要在a中选择td[class="brand"]元素，然后使用.next_sibling.string

html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']

当表格单元格采用混合格式时，将抓取Wikipedia信息框

2 个答案: