BeautifulSoup-在ID字段更改时抓取评论

时间:2018-07-21 20:56:01

标签: python web-scraping beautifulsoup comments

我正在收集多个季节的棒球比赛数据。这是数据的示例。

https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml

对于这个问题,我正在特别寻找一种方法来提取包含裁判和游戏数据的注释。请注意,这些html文件现在存储在本地,因此我试图遍历文件夹。在源代码中,它看起来像这样:

           <div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
  <span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
    <h2>Other Info</h2>    <div class="section_heading_text">
      <ul>
      </ul>
    </div>      
</div><div class="placeholder"></div>
<!--  
    <div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div><div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70&deg; F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>

    </div>

-->  
</div>

您可以看到它在评论中。真正的挑战是ID值会在场所和季节之间变化。我正在解析10年的数据。有人可以告诉我ID实际更改时如何提取注释文本吗?
这是我的代码:

# import libraries and files
from bs4 import BeautifulSoup, Comment
import os

print

# Setup Games list for append
games = []

path = r"D:\My Web Sites\baseball 2\www.baseball-reference.com\boxes\ANA"

for filename in os.listdir(path):
    if filename.endswith(".html"):
        fullpath = os.path.join(path, filename)

        print 'Processing {:}...'.format(fullpath)

# Get Page, Make Soup
    soup = BeautifulSoup(open(fullpath), 'lxml')

# Setting up game object to append to list
    game = {}

# Get Description
    # Note:  Skip every other child because of 'Navigable Strings' from BS.  
    divs = soup.findAll('div', {'scorebox_meta'})
    for div in divs:
        for idx, child in enumerate(div.children):
            if idx == 1:
                game['date'] = child.text
            elif idx == 3:
                game['start_time'] = child.text.split(':', 1)[1].strip()
            elif idx == 7:
                game['venue'] = child.text.split(':', 1)[1].strip()
            elif idx == 9:
                game['duration'] = child.text.split(':', 1)[1].strip()


# Get Player Data from tables
    for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
         data = BeautifulSoup(comment,"lxml")
         for items in data.select("table tr"):
             player_data = [' '.join(item.text.split()) for item in items.select("th,td")]
             print(player_data)
             print '======================================================='

# Get Umpire Data        



# Append game data to full list        
    games.append(game)

    print

print 'Results'
print '*' * 80

# Print the games harvested to the console

for idx, game in enumerate(games):
    print str(idx) + ':  ' + str(game)

# Write to CSV
csvfile = "C:/Users/Benny/Desktop/anatest.csv"

with open(csvfile, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    writer.writerows(game)

非常感谢, 本尼

2 个答案:

答案 0 :(得分:0)

我使用了re模块来提取评论部分:

from bs4 import BeautifulSoup
import re

data = """<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
  <span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
    <h2>Other Info</h2>    <div class="section_heading_text">
      <ul>
      </ul>
    </div>
</div><div class="placeholder"></div>
<!--
    <div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div>
<div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70&deg; F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>

    </div>

-->
</div>"""

soup = BeautifulSoup(re.search(r'(?<=<!--)(.*?)(?=-->)', data, flags=re.DOTALL)[0], 'lxml')

umpires, time_of_game, attendance, start_time_weather = soup.select('div.section_content > div')

print('ID: ', soup.find('div', class_="section_content")['id'])
print('umpires: ', umpires.text)
print('time of game: ', time_of_game.text)
print('attendance: ', attendance.text)
print('start_time_weather: ', start_time_weather.text)

输出:

ID:  div_342042674
umpires:  Umpires: HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
time of game:  Time of Game: 3:21.
attendance:  Attendance: 33,809.
start_time_weather:  Start Time Weather: 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.

答案 1 :(得分:0)

如果您从html元素中删除了<!---->这些恶性标记,则可以轻松访问内容。这是您可以去的方式:

import requests
from bs4 import BeautifulSoup

url = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"

res = requests.get(url)
content = res.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(content,"lxml")
umpire, gametime, attendance, weather = soup.find_all(class_="section_content")[2]("strong")
print(f'{umpire.next_sibling}\n{gametime.next_sibling}\n{attendance.next_sibling}\n{weather.next_sibling}\n')

输出:

 HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
 3:21.
 33,809.
 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.