我正在尝试使用多篇文章创建一个用于文本挖掘目的的数据库。 我通过网络抓取提取身体,然后将这些文章的正文保存在csv文件中。但是,我无法保存所有正文。 我提出的代码只保存文本的最后一个URL(文章),而如果我打印我正在抓取的内容(以及我应该保存的内容),我获取所有文章的正文。
我刚刚从列表中包含了一些网址(其中包含大量网址),只是为了给您一个想法:
import requests
from bs4 import BeautifulSoup
import csv
r=["http://www.nytimes.com/2016/10/12/world/europe/germany-arrest-syrian-refugee.html",
"http://www.nytimes.com/2013/06/16/magazine/the-effort-to-stop-the- attack.html",
"http://www.nytimes.com/2016/10/06/world/europe/police-brussels-knife-terrorism.html",
"http://www.nytimes.com/2016/08/23/world/europe/france-terrorist-attacks.html",
"http://www.nytimes.com/interactive/2016/09/09/us/document-Review-of-the-San-Bernardino-Terrorist-Shooting.html",
]
for url in r:
t= requests.get(url)
t.encoding = "ISO-8859-1"
soup = BeautifulSoup(t.content, 'lxml')
text = soup.find_all(("p",{"class": "story-body-text story-content"}))
print(text)
with open('newdb30.csv', 'w', newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=' ',quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(text)
答案 0 :(得分:0)
尝试在for循环之前声明all_text = ""
之类的变量,并在for循环结束时text
之前将all_text
添加到all_text += text + "\n"
(\n
创建一个新行。)
然后,在最后一行,而不是写text
,而是写all_text
。