我有一个文本文件中的URL列表,我想从中获取文章文本,作者和文章标题。当获得这三个元素时,我希望将它们写入文件。到目前为止,我可以从文本文件中读取URL,但Python只打印出URL和一个(最终文章)。如何重写我的脚本以便Python读取和写入每个URL和内容?
我必须使用以下Python脚本(版本2.7 - Mac OS X Yosemite):
from newspaper import Article
f = open('text.txt', 'r') #text file containing the URLS
for line in f:
print line
url = line
first_article = Article(url)
first_article.download()
first_article.parse()
# write/append to file
with open('anothertest.txt', 'a') as f:
f.write(first_article.title)
f.write(first_article.text)
print str(first_article.title)
for authors in first_article.authors:
print authors
if not authors:
print 'No author'
print str(first_article.text)
答案 0 :(得分:0)
您正在收到上一篇文章,因为您正在遍历该文件的所有行:
for line in f:
print line
一旦循环结束,line包含最后一个值。
url = line
如果您在循环中移动代码的内容,那么:
with open('text.txt', 'r') as f: #text file containing the URLS
with open('anothertest.txt', 'a') as fout:
for url in f:
print(u"URL Line: {}".format(url.encode('utf-8')))
# you might want to remove endlines and whitespaces from
# around the URL, which what strip() does
article = Article(url.strip())
article.download()
article.parse()
# write/append to file
fout.write(article.title)
fout.write(article.text)
print(u"Title: {}".format(article.title.encode('utf-8')))
# print authors only if there are authors to show.
if len(article.authors) == 0:
print('No author!')
else:
for author in article.authors:
print(u"Author: {}".format(author.encode('utf-8')))
print("Text of the article:")
print(article.text.encode('utf-8'))
我还做了一些改进来改进您的代码:
fout
以避免遮蔽第一个文件fout
的开始调用,以避免在每次迭代时打开/关闭文件,article.authors
的长度,而不是检查是否存在authors
由于authors
因为article.authors
因为{{1}}而无法进入循环,因此{{1}}不会存在
是空的。HTH