Question

这是我使用Beautiful Soup抓取的网页源代码。

<tr>
  <td>
    1
  </td>
  <td style="cipher1">
    <img class="cipher2" src="http://cipher3.png" alt="cipher4" title="cipher5" />
    <a href="/cipher6" title="cipher6" class="cipher7"><span class="cipher8">t</span>cipher9</a> 
  </td>
  <td>
    112
  </td>
  <td>
    3510
  </td>

//模式重复

<tr >
 <td>
        2
 </td>
 <td style="cipher1">

我使用BeautifulSoup编写了一些代码，但由于多次出现这种模式，我得到的结果比我想要的要多。

我用过

row1 = soup.find_all('a' ,class_ = "cipher7" )
for row in row1:
    f.write( row['title'] + "\n")

但有了这个，我得到了'cipher7'的多次出现，因为它在网页中多次出现。

所以我可以使用这个

 <td style="cipher1">...

因为它对我想要的东西来说是独一无二的。

那么，如何修改我的代码呢？

Answer 1

您可以先找到td标记（因为您说它是唯一的），然后从中找到指定的a标记。

all_as = []
rows = soup.find_all('td', {'style':'cipher1'})
for row in rows:
    all_as.append(row.find_all('a', class_ = "cipher7"))

Answer 2

您可以使用方便的select方法，该方法将CSS选择器作为参数：

row = soup.select("td[style=cipher1] > a.cipher7")

使用Beautiful Soup在Python中解析网页

2 个答案: