Question

我正在抓取圣诞树农场网站上的链接。首先，我使用了本教程method来获取所有链接。然后，我注意到我想要的链接没有使用正确的超文本传输协议，因此我创建了一个要连接的变量。现在，我试图创建一个if语句，该语句将抓住每个链接并查找后跟“ xmastrees.php”的任何两个字符。如果是这样，那么我的连接变量就在它的前面。如果链接不包含特定文本，则将其删除。例如，NYxmastrees.php将是http://www.pickyourownchristmastree.org/NYxmastrees.php，而../disclaimer.htm将被删除。我已经尝试了多种方法，但似乎找不到正确的方法。

这是我当前拥有的内容，并继续遇到语法错误：del。我注释掉了这一行，并得到另一个错误，说我的字符串对象没有属性“ re”。这使我感到困惑，因为我虽然可以将正则表达式与字符串一起使用？

source = requests.get('http://www.pickyourownchristmastree.org/').text
soup = BeautifulSoup(source, 'lxml')
concatenate = 'http://www.pickyourownchristmastree.org/'

find_state_group = soup.find('div', class_ = 'alert')
for link in find_state_group.find_all('a', href=True):
    if link['href'].re.search('^.\B.\$xmastrees'):
        states = concatenate + link
    else del link['href']
    print(link['href']

else del link['href']错误：

    else del link['href']
           ^
SyntaxError: invalid syntax

没有else del link['href']的错误：

    if link['href'].re.search('^.\B.\$xmastrees'):
AttributeError: 'str' object has no attribute 're'

Answer 1

您可以尝试使用：

import requests
from bs4 import BeautifulSoup as bs

u = "http://www.pickyourownchristmastree.org/"
soup = bs(requests.get(u).text, 'html5lib')

find_state_group = soup.find('div', {"class": 'alert'})
for link in find_state_group.find_all('a', href=True):
    if "mastrees" in link['href']:
        states = u + link['href']
        print(states)

http://www.pickyourownchristmastree.org/ALxmastrees.php
http://www.pickyourownchristmastree.org/AZxmastrees.php
http://www.pickyourownchristmastree.org/AKxmastrees.php
...

Demo

网页抓取链接

1 个答案: