我正在使用python模块newspaper3k
,并使用其Web网址提取文章摘要。
from newspaper import Article
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
print (text)
礼物
Often hailed as Hollywood\xe2\x80\x99s long standing, commercially successful filmmaker, Spielberg\xe2\x80\x99s lifetime gross, if you include his productions, reaches a mammoth\xc2\xa0$17.2 billion\xc2\xa0\xc2\xad\xe2\x80\x93 unadjusted for inflation.
\r\rThe original\xc2\xa0Jurassic Park\xc2\xa0($983.8 million worldwide), which released in 1993, remains Spielberg\xe2\x80\x99s highest grossing film.
Ready Player One,\xc2\xa0currently advancing at a running total of $476.1 million, has become Spielberg\xe2\x80\x99s seventh highest grossing film of his career.It will eventually supplant Aamir\xe2\x80\x99s 2017 blockbuster\xc2\xa0Dangal\xc2\xa0(1.29 billion yuan) if it achieves the Maoyan\xe2\x80\x99s lifetime forecast of 1.31 billion yuan ($208 million) in the PRC.
我要删除所有不需要的字符,例如\xe2\x80\x99s
。我避免使用多个replace
函数。我想要的是这样的东西:-
Often hailed as Hollywood long standing, commercially successful filmmaker,
Spielberg lifetime gross, if you include his productions, reaches a
mammoth $17.2 billion unadjusted for inflation.
The original Jurassic Park ($983.8 million worldwide),
which released in 1993, remains Spielberg highest grossing film.
Ready Player One,currently advancing at a running total of $476.1 million,
has become Spielberg seventh highest grossing film of his career.
It will eventually supplant Aamir 2017 blockbuster Dangal (1.29 billion yuan)
if it achieves the Maoyan lifetime forecast of 1.31 billion yuan ($208 million) in the PRC
答案 0 :(得分:0)
尝试使用正则表达式:
import re
clear_str = re.sub(r'[\xe2\x80\x99s]', '', your_input)
re.sub
用第二个参数替换your_input
中某个模式的所有出现。像[abc]
这样的模式匹配a
,b
或c
字符。
答案 1 :(得分:0)
您可以使用python的encode
/ decode
摆脱所有非拉丁字符
data = text.decode('utf-8')
text = data.encode('latin-1', 'ignore')
答案 2 :(得分:0)
首先使用.encode('ascii',errors='ignore')
忽略所有非ASCII字符。
如果您需要此文本来进行某种情感分析,则您可能还希望删除特殊字符,例如\n
,\r
等,可以通过先转义转义字符来完成,然后借助正则表达式替换它们。
from newspaper import Article
import re
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
text = text.encode('ascii',errors='ignore')
text = str(text) #converts `\n` to `\\n` which can then be replaced by regex
text = re.sub('\\\.','',text) #Removes all substrings of form \\.
print (text)
答案 3 :(得分:0)
该文章被错误地解码。它可能在网站上指定了错误的编码,但是问题中没有有效的URL来重现难以证明的输出。
转义码表示utf8是正确的编码,因此请使用以下代码直接编码回字节(latin1是从前256个Unicode代码点到字节的1:1映射),然后使用utf8解码:
text = text.encode('latin1').decode('utf8')
结果:
斯皮尔伯格的一生总收入(如果算上他的作品)通常达到172亿美元之巨,这是通货膨胀因素调整后的价格。
1993年上映的《侏罗纪公园》(全球侏罗纪电影)(9.838亿美元)仍然是斯皮尔伯格收入最高的电影。 就绪玩家一号目前的总票房为4.761亿美元,已成为斯皮尔伯格职业生涯第七高票房电影,如果能达到毛雁一生的预测值13.1亿元人民币(12.9亿元人民币),它将最终取代阿米尔2017年的重磅炸弹丹格(12.9亿元人民币)。在中国的2.08亿美元。