Question

我已经在python中编写了一个脚本，以便使用网页上的特定搜索来获取特定链接。问题是我得到了四个链接。但是，无论有多少相同类型的链接，我希望只获得与搜索条件匹配的第一个链接。

到目前为止，我的努力是：

import requests
from lxml.html import fromstring

main_url = "http://www.excel-easy.com/vba.html"

def search_item(url):
    response = requests.get(url)
    tree = fromstring(response.text)
    for item in tree.cssselect("a"):
        try:
            if "excel" in item.text.lower():
                url_link = item.attrib['href']
                print(url_link)
        except: pass    

search_item(main_url)

我得到的结果：

http://www.excel-easy.com
http://www.excel-easy.com
http://www.excel-easy.com
http://www.excel-easy.com/introduction/formulas-functions.html

我之后的结果（仅第一个）：

http://www.excel-easy.com

我尝试使用item[0].attrib['href']，但这显然不是一个有效的表达式。任何有关这方面的帮助将不胜感激。

Answer 1

您可以使用xpath表达式。

>>> import requests
>>> from lxml import html
>>> url = "http://www.excel-easy.com/vba.html"
>>> response = requests.get(url).content
>>> tree = html.fromstring(response)

解析了html后，获取页面中所有链接的href列表并循环显示它们。注意曾经转换为小写的一个包含'excel'：展示href并退出循环。

>>> for item in tree.xpath('.//a/@href'):
...     if 'excel' in item.lower():
...         item
...         break
...     
'http://www.excel-easy.com'

Answer 2

我最初使用了列表理解，但我认为这更容易被读作for循环。在列表理解中过滤器有点过于繁琐。我也不认为你需要尝试/阻止这个。如果＆＃34; href＆＃34;它将失败if语句。不在属性中。

import requests
from lxml.html import fromstring

main_url = "http://www.excel-easy.com/vba.html"

def search_item(url):

    response = requests.get(url)
    tree = fromstring(response.text)
    matched = []

    for element in tree.cssselect("a"):
        if "href" in element.attrib and "excel" in element.attrib['href'].lower():
            matched.append(element)

    if matched:
        return matched[0].attrib['href']
    else:
        return None

print(search_item(main_url))

无法从特定搜索中获取第一个链接

2 个答案: