摆脱鹅提取文本的反斜杠

时间:2015-11-06 07:34:43

标签: python regex goose

我对goose提取的文本有一个小的正则表达式问题。

我使用Goose从html页面中提取了干净的文本,goose提供的输出很好,但是有一个小问题。我得到以下字符串。

    My name is Sam\'s, I like to play \'football\'

The actual text looks like 

    My name is Sam's, I like to play 'football'

I am trying to get rid of the backslash. When I try the below code for the text extracted by goose, somehow the code doesn't work, however, if I input the text myself the code works perfectly.

I tried the below code

re.sub(r"\\","",text) or
text.replace("\\","")
text.decode()

请找到以下代码:

from goose import Goose
url = 'http://economictimes.indiatimes.com/news/politics-and-    nation/swach-bharat-drives-draws-inspiration-from-mahatma-    gandhi/articleshow/49203355.cms'
g = Goose()
article = g.extract(url=url)
text=article.cleaned_text

print text
.....International School here on Friday, Gandhi\'s 146th birth anniversary.Gurjit Singh said that apart from Gandhi\'s birth anniversary,....

text=re.sub(r"\\","",text)
print text
.....International School here on Friday, Gandhi\'s 146th birth anniversary.Gurjit Singh said that apart from Gandhi\'s birth anniversary,....

如何摆脱反斜杠。

0 个答案:

没有答案