忽略td beautifulsoup中的N / A值

时间:2018-05-03 08:42:14

标签: python web-scraping beautifulsoup

我想删除相同类别的td中的N / A值

  <td align="left" class="category"> N/A</td>
<td align="left" class="title"> <a href="article-feb-0243.html">Wall Street cool to eBay's profit</a></td>
<td align="left" class="category"> technology</td>
<td align="left" class="title"> <a href="article-feb-2017.html">Warnings about junk mail deluge</a></td>
<td align="left" class="category"> technology</td>
<td align="left" class="title"> <a href="article-feb-2660.html">Web radio takes Spanish rap global</a></td>
<td align="left" class="category"> sport</td>

我想要删除类别和标题,但在类别中要忽略N / A值

for td in parsed_html.body.findAll('td',{"class":lambda class_: class_ in ("category","title")}):
                print(td)
                category=td.parent.find("td",attrs={"class":"category"}).text

                if(not td.parent.find("i")):
                    url=td.parent.find("a")["href"]

我已尝试将字符串匹配到N / A,但它正在工作

1 个答案:

答案 0 :(得分:1)

首先,您不必使用自定义函数来匹配多个类。您可以将不同的类作为列表传递。

其次,有两种方法可以获得你想要的东西。您可以在迭代所有Camera标记时检查文本是否包含N/A,并跳过标记(如果存在)。

<td>

输出:

html = '''
<td align="left" class="category"> N/A</td>
<td align="left" class="title"> <a href="article-feb-0243.html">Wall Street cool to eBay's profit</a></td>
<td align="left" class="category"> technology</td>
<td align="left" class="title"> <a href="article-feb-2017.html">Warnings about junk mail deluge</a></td>
<td align="left" class="category"> technology</td>
<td align="left" class="title"> <a href="article-feb-2660.html">Web radio takes Spanish rap global</a></td>
<td align="left" class="category"> sport</td>'''

soup = BeautifulSoup(html, 'lxml')
for td in soup.find_all('td', class_=['category', 'title']):
    if 'N/A' in td.text:
        continue
    print(td)

您也可以使用自定义功能执行此操作。

<td align="left" class="title"> <a href="article-feb-0243.html">Wall Street cool to eBay's profit</a></td>
<td align="left" class="category"> technology</td>
<td align="left" class="title"> <a href="article-feb-2017.html">Warnings about junk mail deluge</a></td>
<td align="left" class="category"> technology</td>
<td align="left" class="title"> <a href="article-feb-2660.html">Web radio takes Spanish rap global</a></td>
<td align="left" class="category"> sport</td>