如何使用报纸从文本文件中的URL列表中提取报纸文章

时间:2019-01-20 14:20:41

标签: python data-extraction python-newspaper

我正在尝试从文本文件中的多个URL下载/提取文章,然后想在CSV文件中提取相同的文章

我正在创建一个博客,其中包含与特定主题相关的新闻,我想使用python从文本文件中的URL中提取新闻

from newspaper import Article
with open("untitled.txt") as url_file:
    lines = url_file.readlines()
    url = lines
for line in lines:
    article = Article(url)

#我收到以下错误

AttributeError                            Traceback (most recent call last)
<ipython-input-47-ac8a2b1aab1a> in <module>
      1 for line in lines:
----> 2     article = Article(url)

~\Anaconda3\lib\site-packages\newspaper\article.py in __init__(self, url, title, source_url, config, **kwargs)
     58 
     59         if source_url == '':
---> 60             scheme = urls.get_scheme(url)
     61             if scheme is None:
     62                 scheme = 'http'

~\Anaconda3\lib\site-packages\newspaper\urls.py in get_scheme(abs_url, **kwargs)
    277     if abs_url is None:
    278         return None
--> 279     return urlparse(abs_url, **kwargs).scheme
    280 
    281 

~\Anaconda3\lib\urllib\parse.py in urlparse(url, scheme, allow_fragments)
    365     Note that we don't break the components up in smaller bits
    366     (e.g. netloc is a single string) and we don't expand % escapes."""
--> 367     url, scheme, _coerce_result = _coerce_args(url, scheme)
    368     splitresult = urlsplit(url, scheme, allow_fragments)
    369     scheme, netloc, url, query, fragment = splitresult

~\Anaconda3\lib\urllib\parse.py in _coerce_args(*args)
    121     if str_input:
    122         return args + (_noop,)
--> 123     return _decode_args(args) + (_encode_result,)
    124 
    125 # Result objects are more helpful than simple tuples

~\Anaconda3\lib\urllib\parse.py in _decode_args(args, encoding, errors)
    105 def _decode_args(args, encoding=_implicit_encoding,
    106                        errors=_implicit_errors):
--> 107     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    108 
    109 def _coerce_args(*args):

~\Anaconda3\lib\urllib\parse.py in <genexpr>(.0)
    105 def _decode_args(args, encoding=_implicit_encoding,
    106                        errors=_implicit_errors):
--> 107     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    108 
    109 def _coerce_args(*args):

AttributeError: 'list' object has no attribute 'decode'

我想复制该过程,以便可以从数百个URL中提取文本。有没有一种设置方法,所以我可以创建一个包含文章的文本文件并提取文章

根据建议更新1,我更新了代码,但是仍然无法从URL中提取所有文章

from newspaper import Article
with open("untitled.txt") as url_file:
    lines = url_file.readlines()
for line in lines:
    article = Article(line)
article.download()
article.text

Article

我想从URL列表中提取所有文章。

0 个答案:

没有答案