Question

考虑网址：

https://en.wikipedia.org/wiki/NGC_2808

当我直接将其用作temp = requests.get(url).text中的url时，一切正常。

现在，考虑字符串name = NGC2808。现在，当我执行s = name[:3] + '_' + name[3:]然后执行url = 'https://en.wikipedia.org/wiki/' + s时，该程序不再起作用。

这是代码段：

s = name[:3] + '_' + name[3:]
url0 = 'https://en.wikipedia.org/wiki/' + s

url = requests.get(url0).text
soup = BeautifulSoup(url,"lxml")
soup.prettify()

table = soup.find('table',{'class':'infobox'})
tags = table.find_all('tr')

这是错误： AttributeError: 'NoneType' object has no attribute 'find_all'

编辑：该名称并未真正明确定义为"NGC2808"，而是来自扫描.txt文件的名称。但是print(name)会产生NGC2808。现在，当我直接提供名称而不扫描文件时，没有任何错误。为什么会这样？

为什么会这样？

Answer 1

在此处提供minimal reproducible example和错误消息的副本将对您大有帮助，并可能使您对问题有更深入的了解。

尽管如此，以下对我有用：

name = "NGC2808"
s = name[:3] + '_' + name[3:]
url = 'https://en.wikipedia.org/wiki/' + s
temp = requests.get(url).text
print(temp)

由于问题更改而进行了编辑：

您提供的错误表明，美汤无法在您的get请求返回的文档中找到任何表。您是否检查了传递给该请求的网址以及返回的内容？

按目前的情况，我可以使用以下内容获取标签列表（如您所希望的那样）：

import requests
from bs4 import BeautifulSoup
import lxml

name = "NGC2808"
s = name[:3] + '_' + name[3:]
url = 'https://en.wikipedia.org/wiki/' + s
temp = requests.get(url).text
soup = BeautifulSoup(temp,"lxml")
soup.prettify()

table = soup.find('table',{'class':'infobox'})
tags = table.find_all('tr')
print(tags)

行s = name[:3] + '_' + name[3:]缩进的方式很奇怪，表明示例顶部缺少细节。具有此上下文可能会很有用，因为可能涉及到任何逻辑，导致您将格式错误的url传递给get请求。

Answer 2

如果仅在从文件源读取时发生，则 name 字符串中必须包含一些特殊的（Unicode）或空格字符，如果您使用的是PyCharm，则可以进行一些调试，或者可以只需使用 pprint（）或 repr（）方法打印名称字符串（刚从文件中读取后），即可查看导致字符的问题，下面以示例代码为例正常的 print 功能不会显示特殊字符，而pprint会显示...

from bs4 import BeautifulSoup
from pprint import pprint
import requests

# Suppose this is a article id fetched from the file
article_id = "NGC2808   "

# Print will not show any special character
print(article_id)

# Even you can print this special character using repr() method
print(repr(article_id))

# Pprint shows a the character code in place of special character
pprint(article_id)

# Now this code will produce an error
article_id_mod = article_id[:3] + '_' + article_id[3:]
url = 'https://en.wikipedia.org/wiki/' + article_id_mod

response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")

table = soup.find('table',{'class':'infobox'})
if table:
    tags = table.find_all('tr')
    print(tags)

现在解决您可以执行的操作：

如果在字符串的开头/结尾有多余的空格：请使用 strip（）方法

article_id = article_id.strip（）
如果有特殊字符：使用适当的 regex 表达式或简单地使用vscode / sublime / notepad ++之类的编辑器打开文件并使用查找/替换选项。

网页抓取网址构建

2 个答案: