到目前为止,我的代码如下:
from bs4 import BeautifulSoup
import csv
html = open("Greyhound Race and Breeding1.html").read()
soup = BeautifulSoup(html)
table = soup.find("table")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(output_rows)
这比我想要的要多得多,我只想从td之后的表中获得标题为“狗到达终点的顺序”的部分。如何修改我的代码以解决此问题?
我的猜测是应该修改table = soup.find(“ table”),以便我可以找到
<td title="order in which the dogs arrived at the finish">.
但是我不知道如何。也许我应该通过
将table设置为td的父级 <td title="order in which the dogs arrived at the finish">
<table>
<tr>
<td>I don't want this</td>
<td>Or this</td>
</tr>
</table>
<table>
<tr>
<td>I don't want this</td>
<td>Or this</td>
</tr>
</table>
<table>
<tr>
<td title="order in which the dogs arrived at the finish"> I want this and the rest of the document</td>
<td> More things I want</td>
</tr>
</table>
我几乎有Jack Fleetings解决方案可以工作
html = open("Greyhound Race and Breeding1.html").read()
soup = BeautifulSoup(html)
#table = soup.find("table")["title": "order in which the dogs arrived at the finish"]
#table = str(soup.find("table",{"title": "order in which the dogs arrived at the finish"}))
table = soup.find("table")
for table in soup.select('table'):
if table.select_one('td[title="order in which the dogs arrived at the finish"]')is not None:
newTable = table
output_rows = []
for table_row in newTable.findAll("tr"):
columns = table_row.findAll("td")
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
with open("output8.csv", "w") as csvfile:
writer = csv.writer(csvfile)
writer.writerows(output_rows)
问题是它重复同一行几次,但这是正确的表。我尝试了几次来纠正这个问题,但是没有运气。所以我决定改用熊猫:
from bs4 import BeautifulSoup
import csv
import pandas as pd
df = pd.read_html("Greyhound Race and Breeding1.html")
#This shows how many tables there are
print (len(df))
#To find the right table, I bruteforced it by printing print(df[for each table]) #Turns out the table I was looking for was df[8]
print(df[8])
#Finally we move the table to a csv file
df[8].to_csv("Table.csv")
答案 0 :(得分:1)
如果我对您的理解正确,则可以使用CSS选择器来做到这一点:
for table in soup.select('table'):
target = table.select('td[title="order in which the dogs arrived at the finish"]')
if len(target)>0:
print(table)
如果您知道只有一张表可以满足要求,则可以使用:
target = soup.select_one('td[title="order in which the dogs arrived at the finish"]')
print(target.findParent())
输出:
<table>
<tr>
<td title="order in which the dogs arrived at the finish"> I want this and the rest of the document</td>
<td> More things I want</td>
</tr>
</table>