Question

我在python中制作了一个刮刀。它运行顺利。现在我想丢弃或接受该页面中的特定链接，仅包含＆＃34;手机＆＃34;但即使在作出一些有条件的陈述之后，我也无法做到。希望我能得到任何帮助来纠正我的错误。

import requests
from bs4 import BeautifulSoup
def SpecificItem():
    url = 'https://www.flipkart.com/'
    Process = requests.get(url)
    soup = BeautifulSoup(Process.text, "lxml")
    for link in soup.findAll('div',class_='')[0].findAll('a'):
        if "mobiles" not in link:
            print(link.get('href'))
SpecificItem()

另一方面，如果我使用带xpath的lxml库做同样的事情，它可以工作。

import requests
from lxml import html
def SpecificItem():
    url = 'https://www.flipkart.com/'
    Process = requests.get(url)
    tree = html.fromstring(Process.text)
    links = tree.xpath('//div[@class=""]//a/@href')
    for link in links:
        if "mobiles" not in link:
            print(link)

SpecificItem()

所以，在这一点上，我认为使用BeautifulSoup库时，代码应该有所不同，以达到目的。

Answer 1

问题的根源是你的if条件在BeautifulSoup和lxml之间的作用有点不同。基本上，使用BeautifulSoup的if "mobiles" not in link:不会检查"mobiles"字段中是否有href。我看起来并不太难，但我猜它正在将它与link.text字段进行比较。明确使用href字段可以解决问题：

import requests
from bs4 import BeautifulSoup
def SpecificItem():
    url = 'https://www.flipkart.com/'
    Process = requests.get(url)
    soup = BeautifulSoup(Process.text, "lxml")
    for link in soup.findAll('div',class_='')[0].findAll('a'):
        href = link.get('href')
        if "mobiles" not in href:
            print(href)
SpecificItem()

打印出一堆链接，其中没有一个包含“手机”。

如何使用条件语句从网页中筛选特定项目

1 个答案: