我正在尝试从csv文件中收集的多个URL中提取文章。 但是,当我打印输出时,我收到此错误: InvalidSchema:找不到'['http://www.nytimes.com/2016/10/06/world/europe/police-brussels-knife-terrorism.html']'
的连接适配器import csv
import requests
from bs4 import BeautifulSoup
with open('Training_news.csv', newline='') as file:
reader= csv.reader (file, delimiter=' ')
for row in reader:
r=requests.get(row)
r.encoding = "ISO-8859-1"
soup = BeautifulSoup(r.content, 'lxml')
text = soup.find_all(("p",{"class": "story-body-text story-content"}))
我认为问题出在“行”中,当我打印它时,我没有获得包含csv文件中所有URL的单个列表,而是列出了该文件的任何单个值: [ 'http://www.nytimes.com/2016/10/06/world/europe/police-brussels-knife-terrorism.html'] [ 'http://www.nytimes.com/2016/06/29/world/europe/turkey-istanbul-airport-explosions.html']
答案 0 :(得分:0)
row
是一个列表。 requests.get
需要一个字符串。你可以这样做,迭代每一行中的项目:
with open('Training_news.csv', newline='') as file:
reader= csv.reader (file, delimiter=' ')
for row in reader:
for url in row:
r=requests.get(url)
r.encoding = "ISO-8859-1"
soup = BeautifulSoup(r.content, 'lxml')
text = soup.find_all(("p",{"class": "story-body-text story-content"}))