Question

我正在尝试使用python newspaper模块获取新闻文章的内容。我可以使用以下代码找到新闻项的正文。代码使用feedparser解析feed_url变量中的供稿网址，然后尝试使用报纸模块查找新闻正文和发布日期。

import newspaper
from newspaper import Article
import feedparser
import urllib.parse

count = 0
feed_url="https://www.extremetech.com/feed"
#feed_url="http://www.prothomalo.com/feed/"
d = feedparser.parse(feed_url)
for post in d.entries:
    count+=1
    if count == 2:
        break

    #post_link = post.link
    post_link =urllib.parse.unquote(post.link) #Added later to decode the
    # encoded URL into the  original Bengali langauge            
    print("count= ",count," url = ",post_link,end="\n ")

    try:

        content = Article(post_link)
        content.download()
        content.parse()
        print(" content = ", end=" ")
        print(content.text[0:50])
        print(" content.publish_date = {}".format(content.publish_date))


    except Exception as e:
        print(e)

我在代码中提到了变量feed_url的两个不同值 - 一个来自extremetch网站，另一个来自prothomalo网站。

我们假设，例如，extremetech有一个新闻项目（我通过feedparser.parse），其中URL为 https://www.extremetech.com/computing/263951-mit-announces-new-neural-network-processor-cuts-power-consumption-95。我可以轻松获取此网址的新闻正文和发布日期。

但是例如prothomalo有一个新闻项目，其URL（从feedparser.parse获得）为http://www.prothomalo.com/sports/article/1432086/%E0%A6%B8%E0%A6%B0%E0%A7%8D%E0%A6%AC%E0%A7%8B%E0%A6%9A%E0%A7%8D%E0%A6%9A-%E0%A6%B8%E0%A7%8D%E0%A6%95%E0%A7%8B%E0%A6%B0-%E0%A6%97%E0%A7%9C%E0%A7%87%E0%A6%93-%E0%A6%B9%E0%A6%BE%E0%A6%B0。

但是prothomalo网站上的实际网址并不是这样。您可以访问该网址，并会发现该网址已更改为孟加拉语。我认为这种加密（？）URL背后的原因是URL有一些孟加拉语的部分。这里的内容也是孟加拉语。

Python报纸模块可以从extretemetech网站提取内容和发布日期，而不是从prothomalo提取。失败是由于prothomalo网址中的非英文字符造成的吗？

如何从prothomalo网站（也许是包含非英文网址的网站）获取新闻内容，发布日期等？

编辑1： 我可以使用以下行：post_link =urllib.parse.unquote(post.link)将prothomalo的编码URL解码为原始的孟加拉语。我仍然无法获得内容和发布日期。

python报纸 - 如果网址不是英文，则无法提取文章

0 个答案: