我正在尝试创建一个文本分隔文件,其中包含网页上“操作”表格中的数据,如下所示:http://stats.swehockey.se/Game/Events/300978
我希望每一行都包含游戏#(从URL的末尾),然后是表格中行的文本。例如:
300972 | 60:00 | GK Out | OHK | 33. Hudacek, Julius
我无法让每一行实际分开。我尝试使用剥离字符串列表解析每个行和列,并使用不同的标记,类和样式进行搜索。
以下是我目前的情况:
from bs4 import BeautifulSoup
import urllib.request
def createtext():
gamestr = urlstr + "|"
#Find all table lines. Create one pipe-delimited line for each.
aptext = gamestr
for el in soup.find_all('tr'):
playrow = el.find_all('td', 'tdOdd')
for td in playrow:
if(td.find(text=True)) not in ("", None, "\n"):
aptext = aptext + ''.join(td.text) + "|"
aptext = aptext + "\n" + gamestr
#Creates file with Game # as filename and writes the data to the file
currentfile = urlstr + ".txt"
with open(currentfile, "w") as f:
f.write(str(aptext))
#Grabs the HTML file and creates the soup
urlno = 300978
urlstr = str(urlno)
url = ("http://stats.swehockey.se/Game/Events/" + urlstr)
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
pbpdoc = response.read().decode('utf-8')
soup = BeautifulSoup(pbpdoc)
createtext()
感谢您提供任何帮助或指导!
答案 0 :(得分:0)
首先,您不必手动构建CSV数据,Python为此提供了内置csv
module。
然后,由于您只能使用“操作”,因此我会识别“操作”表并查找仅限事件的行。这可以通过过滤函数帮助完成,检查第一个单元格是否为空:
import csv
from bs4 import BeautifulSoup
import requests
def only_action_rows(tag):
if tag.name == 'tr':
first_cell = tag.find('td', class_='tdOdd')
return first_cell and first_cell.get_text(strip=True)
event_id = 300978
url = "http://stats.swehockey.se/Game/Events/{event_id}".format(event_id=event_id)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
actions_table = soup.find("h2", text="Actions").find_parent("table")
data = [[event_id] + [td.get_text(strip=True) for td in row.find_all('td', class_='tdOdd')]
for row in actions_table.find_all(only_action_rows)]
with open("output.csv", "w") as f:
writer = csv.writer(f)
writer.writerows(data)
请注意,我在这里使用requests
。