我正在尝试抓取Wikipedia信息框并获取某些关键字的信息。例如:https://en.wikipedia.org/wiki/A%26W_Root_Beer
比方说,我正在寻找制造商的值。我希望它们出现在列表中,而我只想要它们的文本。因此,在这种情况下,所需的输出将为['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']
。
无论我如何尝试,都无法成功生成此列表。这是我的一部分代码:
url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
# take th.text and td.text
我想要一种在各种情况下都可以使用的方法:当方式中有换行符时,某些值是链接时,某些值是段落时等等。在所有情况下,我只希望我们在屏幕上看到的文本,不是链接,不是段落,只是纯文本。我也不希望输出为Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada)
,因为稍后我希望能够解析结果并对每个实体执行某些操作。
我正在浏览许多Wikipedia页面,但找不到适合其中很大一部分的方法。您能帮我提供工作代码吗?我不擅长抓取。
答案 0 :(得分:1)
好的,这是我的尝试(json库仅用于漂亮地打印字典):
fixedEncodeURIComponent(yourUrl) (JavaScript) = (PHP) rawurlencode(yourUrl)
该代码将import json
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/ABC_Studios"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
info = {}
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
if th is not None:
innerText = ''
for elem in td.recursiveChildGenerator():
if isinstance(elem, str):
innerText += elem.strip()
elif elem.name == 'br':
innerText += '\n'
info[th.text] = innerText
print(json.dumps(info, indent=1))
标记替换为<br/>
,从而得到:
\n
如果您要返回列表而不是带有{
"Trading name": "ABC Studios",
"Type": "Subsidiary\nLimited liability company",
"Industry": "Television production",
"Predecessor": "Touchstone Television",
"Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
"Headquarters": "Burbank, California,U.S.",
"Area served": "Worldwide",
"Key people": "Patrick Moran (President)",
"Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)",
"Website": "abcstudios.go.com"
}
的字符串,则可以对其进行调整
\n
哪个给:
innerTextList = innerText.split("\n")
if len(innerTextList) < 2:
info[th.text] = innerTextList[0]
else:
info[th.text] = innerTextList
答案 1 :(得分:1)
此代码无效
soup = BeautifulSoup(requests.get(url), "lxml")
BeautifulSoup需要requests
内容,附加.text
或.content
。
要获得预期的制造结果,您需要在a
中选择td[class="brand"]
元素,然后使用.next_sibling.string
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']