编程和Web抓取的新手,让BeautifulSoup仅提取给定页面中的文本有些麻烦。
这就是我现在正在使用的东西:
import requests
from bs4 import BeautifulSoup
url = 'https://www.tsn.ca/panarin-tops-2019-free-agent-frenzy-class-1.1303592'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
players = soup.find_all('td').text
print(players)
哪个返回以下内容:
Traceback (most recent call last):
File "tsn.py", line 10, in <module>
players = soup.find_all('td').text
File "/home/debian1/.local/lib/python3.5/site-packages/bs4/element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
我还看到BS文档中使用了.get_text()
,但是返回了相同的错误。
答案 0 :(得分:3)
您的解决方案是正确的。您可以从find_all()
方法中获取值列表。您要做的就是迭代它并获取所需的文本。我已更正代码并将其放在下面。
import requests
from bs4 import BeautifulSoup
url = 'https://www.tsn.ca/panarin-tops-2019-free-agent-frenzy-class-1.1303592'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
# This is how you should have extracted the text from the ResultSet
players = [elem.text for elem in soup.find_all('td')]
print(players)
答案 1 :(得分:1)
find_all()
将返回符合您要求的所有元素的列表。即使仅单个项目,或者找不到任何项目,它也将分别返回[item]
或[]
。要获取文本,您将需要索引到以下项目:
players_list = soup.find_all('td')
for player in players_list:
print(player.text)
我在脚本中使用了.getText()
,我不确定.text
是否工作正常!
答案 2 :(得分:1)
该错误表示您应该像这样遍历每个项目:
players = [item.text for item in soup.find_all('td')] # Iterate over every item and extract the text
print(players)
print("".join(players)) # If you want all the text in one string
希望这会有所帮助!
答案 3 :(得分:0)
这是一个有效的脚本:
import requests
from bs4 import BeautifulSoup
url = 'https://www.tsn.ca/panarin-tops-2019-free-agent-frenzy-class-1.1303592'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
players = []
tbl = soup.find('table', attrs={'class':'stats-table-scrollable article-table'})
tbl_body = tbl.find('tbody')
rows = tbl_body.find_all('tr')
for row in rows:
columns = row.find_all('td')
columns = [c.text for c in columns]
players.append(columns[1])
print(players)
结果:
['Artemi Panarin', 'Erik Karlsson', 'Sergei Bobrovsky', 'Matt Duchene', 'Jeff Skinner', 'Anders Lee', 'Joe Pavelski', 'Brock Nelson', 'Tyler Myers', 'Mats Zuccarello', 'Alex Edler', 'Gustav Nyquist', 'Jordan Eberle', 'Micheal Ferland', 'Jake Gardiner', 'Ryan Dzingel', 'Kevin Hayes', 'Brett Connolly', 'Marcus Johansson', 'Braydon Coburn', 'Wayne Simmonds', 'Brandon Tanev', 'Joonas Donskoi', 'Colin Wilson', 'Ron Hainsey']