Question

我有一个文本文件中的URL列表，我想从中获取文章文本，作者和文章标题。当获得这三个元素时，我希望将它们写入文件。到目前为止，我可以从文本文件中读取URL，但Python只打印出URL和一个（最终文章）。如何重写我的脚本以便Python读取和写入每个URL和内容？

我必须使用以下Python脚本（版本2.7 - Mac OS X Yosemite）：

from newspaper import Article

f = open('text.txt', 'r') #text file containing the URLS
for line in f:
    print line

url = line
first_article = Article(url)
first_article.download()

first_article.parse()

# write/append to file 
with open('anothertest.txt', 'a') as f:
    f.write(first_article.title)
    f.write(first_article.text)

print str(first_article.title)

for authors in first_article.authors:
    print authors
if not authors:
    print 'No author'

print str(first_article.text)

Answer 1

您正在收到上一篇文章，因为您正在遍历该文件的所有行：

for line in f:
    print line

一旦循环结束，line包含最后一个值。

url = line

如果您在循环中移动代码的内容，那么：

with open('text.txt', 'r') as f: #text file containing the URLS
    with open('anothertest.txt', 'a') as fout:
        for url in f:
            print(u"URL Line: {}".format(url.encode('utf-8')))

            # you might want to remove endlines and whitespaces from 
            # around the URL, which what strip() does
            article = Article(url.strip())
            article.download()
            article.parse()

            # write/append to file 
            fout.write(article.title)
            fout.write(article.text)

            print(u"Title: {}".format(article.title.encode('utf-8')))

            # print authors only if there are authors to show.
            if len(article.authors) == 0:
                print('No author!')
            else:
                for author in article.authors:
                    print(u"Author: {}".format(author.encode('utf-8')))

            print("Text of the article:")
            print(article.text.encode('utf-8'))

我还做了一些改进来改进您的代码：

使用open（）也可以读取文件，以正确释放文件描述符当你不再需要它时;
调用输出文件fout以避免遮蔽第一个文件
在进入循环之前完成了fout的开始调用，以避免在每次迭代时打开/关闭文件，
检查article.authors的长度，而不是检查是否存在authors 由于authors因为article.authors因为{{1}}而无法进入循环，因此{{1}}不会存在是空的。

HTH

Python从文件读取URL并打印到文件

1 个答案: