当表格单元格采用混合格式时,将抓取Wikipedia信息框

时间:2019-01-10 01:39:45

标签: python web-scraping beautifulsoup wikipedia

我正在尝试抓取Wikipedia信息框并获取某些关键字的信息。例如:https://en.wikipedia.org/wiki/A%26W_Root_Beer

比方说,我正在寻找制造商的值。我希望它们出现在列表中,而我只想要它们的文本。因此,在这种情况下,所需的输出将为['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']。 无论我如何尝试,都无法成功生成此列表。这是我的一部分代码:

url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:

        th = tr.find("th")
        td = tr.find("td")

    # take th.text and td.text

我想要一种在各种情况下都可以使用的方法:当方式中有换行符时,某些值是链接时,某些值是段落时等等。在所有情况下,我只希望我们在屏幕上看到的文本,不是链接,不是段落,只是纯文本。我也不希望输出为Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada),因为稍后我希望能够解析结果并对每个实体执行某些操作。

我正在浏览许多Wikipedia页面,但找不到适合其中很大一部分的方法。您能帮我提供工作代码吗?我不擅长抓取。

2 个答案:

答案 0 :(得分:1)

好的,这是我的尝试(json库仅用于漂亮地打印字典):

fixedEncodeURIComponent(yourUrl) (JavaScript) = (PHP) rawurlencode(yourUrl)

该代码将import json from bs4 import BeautifulSoup import requests url = "https://en.wikipedia.org/wiki/ABC_Studios" r = requests.get(url) soup = BeautifulSoup(r.text, "lxml") tbl = soup.find("table", {"class": "infobox vcard"}) list_of_table_rows = tbl.findAll('tr') info = {} for tr in list_of_table_rows: th = tr.find("th") td = tr.find("td") if th is not None: innerText = '' for elem in td.recursiveChildGenerator(): if isinstance(elem, str): innerText += elem.strip() elif elem.name == 'br': innerText += '\n' info[th.text] = innerText print(json.dumps(info, indent=1)) 标记替换为<br/>,从而得到:

\n

如果您要返回列表而不是带有{ "Trading name": "ABC Studios", "Type": "Subsidiary\nLimited liability company", "Industry": "Television production", "Predecessor": "Touchstone Television", "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)", "Headquarters": "Burbank, California,U.S.", "Area served": "Worldwide", "Key people": "Patrick Moran (President)", "Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)", "Website": "abcstudios.go.com" } 的字符串,则可以对其进行调整

\n

哪个给:

    innerTextList = innerText.split("\n")
    if len(innerTextList) < 2:
        info[th.text] = innerTextList[0]
    else:
        info[th.text] = innerTextList

答案 1 :(得分:1)

此代码无效

soup = BeautifulSoup(requests.get(url), "lxml")

BeautifulSoup需要requests内容,附加.text.content

要获得预期的制造结果,您需要在a中选择td[class="brand"]元素,然后使用.next_sibling.string

html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']