<td class="left " data-append-csv="adamja01" data-stat="player" csk="Adam,Jason0.01"><a href="/players/a/adamja01.shtml">Jason Adam</a></td>
这是我到目前为止所拥有的代码...我想将Adam,Jason导入到excel中。他的名字似乎位于“ csk”中。任何建议都将非常有帮助。谢谢!
from urllib.request import urlopen
from bs4 import BeautifulSoup
content = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-pitching.shtml")
soup = BeautifulSoup(content.read(),"lxml")
tags = soup.findAll('div')
for t in tags:
print(t)
答案 0 :(得分:0)
尝试以下脚本来获取它们。您愿意获取的数据在注释中,这就是为什么通常的方法不允许您收集它们的原因:
from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment
content = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-pitching.shtml")
soup = BeautifulSoup(content.read(),"lxml")
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
sauce = BeautifulSoup(comment,"lxml")
for tags in sauce.find_all('tr'):
name = [item.get("csk") for item in tags.find_all("td")[:1]]
print(name)
答案 1 :(得分:0)
使用lxml
会更快:
from urllib.request import urlopen
#from bs4 import BeautifulSoup, Comment
from lxml import html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-pitching.shtml")
content = response.read()
tree = html.fromstring( content )
#Now we need to find our target table (comment text)
comment_html = tree.xpath('//comment()[contains(., "players_standard_pitching")]')[0]
#removing HTML comment markup
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
#parsing our target HTML again
tree = html.fromstring( comment_html )
for pitcher_row in tree.xpath('//table[@id="players_standard_pitching"]/tbody/tr[contains(@class, "full_table")]'):
csk = pitcher_row.xpath('./td[@data-stat="player"]/@csk')[0]
print(csk)