Question

我是这个网络抓取世界的新手，到目前为止，我对BeautifulSoup感到惊讶。但是，我无法做到这一点。

我想要做的是删除一些标签，这些标签后面跟着一些特定标签和特定属性。

让我告诉你：

#Import modules
from bs4 import BeautifulSoup
import requests

#Parse URL
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

#This is the table which I want to extract
table = soup.find_all('table')[4]

在获得我想要操作的正确表后，有一些'tr'标签后跟'td'和属性'colspan'

我最终想要删除那些特定的'tr'，因为我需要更多'tr'标签。

'tsp'与'colspan'属性的总和为3：

#Output for 'td' with 'colspan'

print(table.select('td[colspan]'))

[<td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
 <td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
 <td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>]

以下是HTML的摘录和我要删除的特定“tr”的一个示例（在下面插入一条说“#THIS ONE！”）：

 <td align="center">
    2:1
   </td>
   <td class="one">
    AC Milan
   </td>
   <td>
    <a href="/Cagliari-AC_Milan-2320071-2320071.html">
     <img alt="More details about  -  soccer game" border="0" height="14" src="/imgs/detail3.gif" width="14"/>
    </a>
   </td>
  </tr>
  ***<tr class="predict"> ------------- >>> **#THIS ONE!*****
   <td colspan="13">
    <img height="10" src="/imgs/line.png" width="100%"/>
   </td>
   <tr class="predict">
    <td>
     27 May
    </td>
    <td>
     38
    </td>
    <td>
     FT
    </td>
    <td align="right" class="one">

顺便说一句，我想删除'td colspan'和'img'。

有什么想法吗？

*安装了Python最新版本

* BeautifulSoup模块安装最新版本

Answer 1

找到您要删除的特定代码，然后使用deompose()或extract()。

for tag in tags_to_delete:
    tag.decompose()

或者

for tag in tags_to_delete:
    tag.extract()

编辑

要查找特定代码，您可以先找到所有tr代码，然后检查该代码是否包含td属性colspan="13"，如果是，则decompose()。

import requests
from bs4 import BeautifulSoup

url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')

table = soup.find_all('table')[4]    
for t in table.find_all("tr", class_="predict"):

    check = t.find("td", colspan="13")
    if(check != None):
        t.decompose()

Answer 2

您已经获得了table和td[colspan]，然后您可以从td获得parent元素，然后decompose来自table {1}}，并将解析器从html.parser更改为lxml，如下所示：

from bs4 import BeautifulSoup
import requests

#Parse URL
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml') #change the parser from html.parser to lxml

#This is the table which I want to extract
table = soup.find_all('table')[4]
for tdcol in table.select('td[colspan]'):
    tdcol.parent.decompose()
print table.prettify()

然后table将删除项目：

<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>
<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>
<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>

BeautifulSoup删除标签，后跟特定标签和特定属性

2 个答案: