我正在使用此脚本从sciencedirect文章中检索作者信息,但是在尝试打印该值时却一无所获。
import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('urls.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urlopen(url)
soup = BeautifulSoup(site, "lxml")
for item in soup.find_all("div", {"class": "AuthorGroups"}):
final = item.text,url
print final
在urls.txt中,我使用了这两个url(https://www.sciencedirect.com/science/article/pii/009286749290520M,https://www.sciencedirect.com/science/article/pii/0092867495903682)
答案 0 :(得分:1)
如果BeautifulSoup没有返回期望值,请参阅来自服务器的html响应。
您的请求被阻止,因为它需要设置适当的用户代理。
.....
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}
for url in urls:
print url
site = requests.get(url, headers=headers).text
.....