Question

我正在使用BeautifulSoup完成Python中的抓取任务，并且遇到了一些奇怪的错误。它提到了条带，我没有使用它，但我猜可能与BSoup的过程有关？

在我尝试转到原始网址的任务中，找到第18个链接，单击该链接7次，然后在第7页返回第18个链接的名称结果。我正在尝试使用函数从第18个链接获取href，然后调整全局变量以每次使用不同的URL进行递归。对我所缺少的任何建议都会非常有帮助。我将列出代码和错误：

from bs4 import BeautifulSoup
import urllib
import re

nameList = []
urlToUse = "http://python-data.dr-chuck.net/known_by_Basile.html"

def linkOpen():
    global urlToUse
    html = urllib.urlopen(urlToUse)
    soup = BeautifulSoup(html, "lxml")
    tags = soup("li")
    count = 0
    for tag in tags:
        if count == 17:
            tagUrl = re.findall('href="([^ ]+)"', str(tag))
            nameList.append(tagUrl)
            urlToUse = tagUrl
            count = count + 1
        else:
            count = count + 1
            continue

bigCount = 0
while bigCount < 9:
    linkOpen()
    bigCount = bigCount + 1

print nameList[8]

错误：

Traceback (most recent call last):
  File "assignmentLinkScrape.py", line 26, in <module>
    linkOpen()
  File "assignmentLinkScrape.py", line 10, in linkOpen
    html = urllib.urlopen(urlToUse)
  File         

"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)   File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 185, in open
    fullurl = unwrap(toBytes(fullurl))   File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1075, in unwrap
        url = url.strip() AttributeError: 'list' object has no attribute 'strip'

Answer 1

re.findall()会返回匹配列表。 urlToUse是一个列表，您尝试将其传递给需要URL字符串的urlopen()。

Answer 2

Alexce 已经解释了你的错误，但你根本不需要正则表达式，你只想获得第18个li标签并从其中的锚标签中提取href，你可以使用 find 找：

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("http://python-data.dr-chuck.net/known_by_Basile.html").content,"lxml")

url = soup.find("ul").find_all("li", limit=18)[-1].a["href"]

或者使用css选择器：

url = soup.select_one("ul li:nth-of-type(18) a")["href"]

所以要在访问url七次后获取名称，将逻辑放在一个函数中，访问初始url然后访问并提取锚七次，然后在最后一次访问时从锚中提取文本：

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("http://python-data.dr-chuck.net/known_by_Basile.html").content,"lxml")

def get_nth(n, soup):
    return soup.select_one("ul li:nth-of-type({}) a".format(n))

start = get_nth(18, soup)
for _ in range(7):
    soup = BeautifulSoup(requests.get(start["href"]).content,"html.parser")
    start = get_nth(18, soup)
print(start.text)

当我不使用strip（）时，为什么会出现与strip（）相关的错误？（蟒蛇）

2 个答案:

当我不使用strip（）时，为什么会出现与strip（）相关的错误？ （蟒蛇）

2 个答案:

当我不使用strip（）时，为什么会出现与strip（）相关的错误？（蟒蛇）