用BeautifulSoup刮擦桌子的一部分

时间:2020-06-08 19:25:01

标签: python beautifulsoup

到目前为止,我的代码如下:

from bs4 import BeautifulSoup
import csv
html = open("Greyhound Race and Breeding1.html").read()
soup = BeautifulSoup(html)
table = soup.find("table")

output_rows = []
for table_row in table.findAll('tr'):
  columns = table_row.findAll('td')
  output_row = []
for column in columns:
  output_row.append(column.text)
  output_rows.append(output_row)

with open('output.csv', 'w') as csvfile:
  writer = csv.writer(csvfile)
  writer.writerows(output_rows)

这比我想要的要多得多,我只想从td之后的表中获得标题为“狗到达终点的顺序”的部分。如何修改我的代码以解决此问题?

我的猜测是应该修改table = soup.find(“ table”),以便我可以找到

    <td title="order in which the dogs arrived at the finish">. 

但是我不知道如何。也许我应该通过

将table设置为td的父级
    <td title="order in which the dogs arrived at the finish">

<table> 
<tr>
  <td>I don't want this</td>
  <td>Or this</td>
</tr>
</table>

<table> 
<tr>
  <td>I don't want this</td>
  <td>Or this</td>
</tr>
</table>
<table>
<tr>
<td title="order in which the dogs arrived at the finish"> I want this and the rest of  the document</td>
<td> More things I want</td>
</tr> 
</table>

我几乎有Jack Fleetings解决方案可以工作

html = open("Greyhound Race and Breeding1.html").read()
soup = BeautifulSoup(html)
#table = soup.find("table")["title": "order in which the dogs arrived at the finish"]

#table = str(soup.find("table",{"title": "order in which the dogs arrived at the finish"}))
table = soup.find("table")
for table in soup.select('table'):
    if table.select_one('td[title="order in which the dogs arrived at the finish"]')is not None:
                          newTable = table
output_rows = []
for table_row in newTable.findAll("tr"):
   columns = table_row.findAll("td")
   output_row = []
   for column in columns:
      output_row.append(column.text)
      output_rows.append(output_row)

with open("output8.csv", "w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(output_rows)


问题是它重复同一行几次,但这是正确的表。我尝试了几次来纠正这个问题,但是没有运气。所以我决定改用熊猫:


from bs4 import BeautifulSoup
import csv
import pandas as pd



df = pd.read_html("Greyhound Race and Breeding1.html")

#This shows how many tables there are
print (len(df)) 

#To find the right table, I bruteforced it by printing print(df[for each table]) #Turns out the table I was looking for was df[8]
print(df[8])

#Finally we move the table to a csv file
df[8].to_csv("Table.csv")


1 个答案:

答案 0 :(得分:1)

如果我对您的理解正确,则可以使用CSS选择器来做到这一点:

for table in soup.select('table'):
    target = table.select('td[title="order in which the dogs arrived at the finish"]')
    if len(target)>0:
        print(table)

如果您知道只有一张表可以满足要求,则可以使用:

target = soup.select_one('td[title="order in which the dogs arrived at the finish"]')
print(target.findParent())

输出:

<table>
<tr>
<td title="order in which the dogs arrived at the finish"> I want this and the rest of  the document</td>
<td> More things I want</td>
</tr>
</table>