我在搜索来自seekalpha网站的数据时遇到了问题。我知道到目前为止已经多次询问过这个问题,但所提供的解决方案没有帮助
我有以下代码块:
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
def scrape_news(url, source):
opener = AppURLopener()
if(source=='SeekingAlpha'):
print(url)
with opener.open(url) as response:
s = response.read()
data = BeautifulSoup(s, "lxml")
print(data)
scrape_news('https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer','SeekingAlpha')
知道这里可能出现什么问题吗?
编辑: 整个追溯:
Traceback (most recent call last):
File ".\news.py", line 107, in <module>
scrape_news('https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer','SeekingAlpha')
File ".\news.py", line 83, in scrape_news
with opener.open(url) as response:
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\urllib\response.py", line 30, in __enter__
raise ValueError("I/O operation on closed file")
ValueError: I/O operation on closed file
答案 0 :(得分:2)
您的网址返回403.请在终端中尝试确认:
curl -s -o /dev/null -w "%{http_code}" https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer
或者,在Python repl中尝试这个:
import urllib.request
url = 'https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer'
opener = urllib.request.FancyURLopener()
response = opener.open(url)
print(response.getcode())
FancyURLOpener
正在吞下有关失败响应代码的任何错误,这就是为什么您的代码会继续response.read()
而不是退出,即使它没有记录有效的响应。标准urllib.request.urlopen
应该通过在403错误上抛出异常来为您处理,否则您可以自己处理它。