我正在收集多个季节的棒球比赛数据。这是数据的示例。
https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml
对于这个问题,我正在特别寻找一种方法来提取包含裁判和游戏数据的注释。请注意,这些html文件现在存储在本地,因此我试图遍历文件夹。在源代码中,它看起来像这样:
<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div><div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>
您可以看到它在评论中。真正的挑战是ID值会在场所和季节之间变化。我正在解析10年的数据。有人可以告诉我ID实际更改时如何提取注释文本吗?
这是我的代码:
# import libraries and files
from bs4 import BeautifulSoup, Comment
import os
print
# Setup Games list for append
games = []
path = r"D:\My Web Sites\baseball 2\www.baseball-reference.com\boxes\ANA"
for filename in os.listdir(path):
if filename.endswith(".html"):
fullpath = os.path.join(path, filename)
print 'Processing {:}...'.format(fullpath)
# Get Page, Make Soup
soup = BeautifulSoup(open(fullpath), 'lxml')
# Setting up game object to append to list
game = {}
# Get Description
# Note: Skip every other child because of 'Navigable Strings' from BS.
divs = soup.findAll('div', {'scorebox_meta'})
for div in divs:
for idx, child in enumerate(div.children):
if idx == 1:
game['date'] = child.text
elif idx == 3:
game['start_time'] = child.text.split(':', 1)[1].strip()
elif idx == 7:
game['venue'] = child.text.split(':', 1)[1].strip()
elif idx == 9:
game['duration'] = child.text.split(':', 1)[1].strip()
# Get Player Data from tables
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
data = BeautifulSoup(comment,"lxml")
for items in data.select("table tr"):
player_data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(player_data)
print '======================================================='
# Get Umpire Data
# Append game data to full list
games.append(game)
print
print 'Results'
print '*' * 80
# Print the games harvested to the console
for idx, game in enumerate(games):
print str(idx) + ': ' + str(game)
# Write to CSV
csvfile = "C:/Users/Benny/Desktop/anatest.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(game)
非常感谢, 本尼
答案 0 :(得分:0)
我使用了re
模块来提取评论部分:
from bs4 import BeautifulSoup
import re
data = """<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div>
<div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>"""
soup = BeautifulSoup(re.search(r'(?<=<!--)(.*?)(?=-->)', data, flags=re.DOTALL)[0], 'lxml')
umpires, time_of_game, attendance, start_time_weather = soup.select('div.section_content > div')
print('ID: ', soup.find('div', class_="section_content")['id'])
print('umpires: ', umpires.text)
print('time of game: ', time_of_game.text)
print('attendance: ', attendance.text)
print('start_time_weather: ', start_time_weather.text)
输出:
ID: div_342042674
umpires: Umpires: HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
time of game: Time of Game: 3:21.
attendance: Attendance: 33,809.
start_time_weather: Start Time Weather: 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.
答案 1 :(得分:0)
如果您从html元素中删除了<!--
,-->
这些恶性标记,则可以轻松访问内容。这是您可以去的方式:
import requests
from bs4 import BeautifulSoup
url = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
res = requests.get(url)
content = res.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(content,"lxml")
umpire, gametime, attendance, weather = soup.find_all(class_="section_content")[2]("strong")
print(f'{umpire.next_sibling}\n{gametime.next_sibling}\n{attendance.next_sibling}\n{weather.next_sibling}\n')
输出:
HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
3:21.
33,809.
70° F, Wind 6mph out to Centerfield, Night, No Precipitation.