在此网址(https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555)
我的目的是获取所有新闻列表
在网址s html code(contain news
的网址中)
<div class="cnn-search__result-thumbnail">
<a href="https://www.cnn.com/2018/03/27/asia/north-korea-kim-jong-un-china- visit/index.html">
<img src="./Search CNN - Videos, Pictures, and News -
CNN.com_files/180328104116china-xi-kim-story-body.jpg">
</a>
无法获取网址的新闻列表
https://edition.cnn.com/search/?q=%20news&size=10&from=5550&page=556的链接
https://edition.cnn.com/search/?q=%20news&size=10&from=5560&page=557的链接是相同的
我的源代码
def freeze_support():
'''
Check whether this is a fake forked process in a frozen executable.
If so then run code specified by commandline and exit.
'''
if sys.platform == 'win32' and getattr(sys, 'frozen', False):
from multiprocessing.forking import freeze_support
freeze_support()
if __name__ == '__main__':
freeze_support()
for x in range(1, 6000):
url = "https://edition.cnn.com/search/?q=%20news&size=10&from=" + str(x * 10) + "&page=" + str(x + 1)
cnn_paper = newspaper.build(url, memoize_articles=False) # ~15 seconds
print(len(cnn_paper.articles))
list = []
for article in cnn_paper.articles:
if article.url not in url_list:
list.append(article.url)