从字符串中删除反斜杠

时间:2015-10-16 11:48:59

标签: python nltk

我有一个像I don't want it, there'll be others

这样的句子的字符串

因此文字看起来像I don\'t want it, there\'ll be other

由于某种原因,\附带'旁边的文字。它是从另一个来源读入的。我想删除它,但不能。我试过了。 sentence.replace("\'","'")

sentence.replace(r"\'","'")

sentence.replace("\\","")

sentence.replace(r"\\","")

sentence.replace(r"\\\\","")

我知道\是为了逃避某些事情,所以不确定如何使用引号

4 个答案:

答案 0 :(得分:8)

\就是escape '个字符。它只在字符串的表示形式(repr)中可见,它实际上不是字符串中的字符。请参阅以下演示

>>> repr("I don't want it, there'll be others")
'"I don\'t want it, there\'ll be others"'

>>> print("I don't want it, there'll be others")
I don't want it, there'll be others

答案 1 :(得分:0)

尝试使用:

sentence.replace("\\", "")

你需要两个反斜杠,因为它们中的第一个充当转义符号,第二个是你需要替换的符号。

答案 2 :(得分:0)

最好使用正则表达式删除反斜杠:

>>> re.sub(u"u\005c'", r"'", "I don\'t want it, there\'ll be other")
"I don't want it, there'll be other"

答案 3 :(得分:0)

如果您的文本来自已抓取的文本,并且在使用NLP工具处理之前未通过unescaping进行清理,那么您可以轻松地取消HTML标记,例如:

python2.x

>>> import sys; sys.version
'2.7.6 (default, Jun 22 2015, 17:58:13) \n[GCC 4.8.2]'
>>> import HTMLParser
>>> txt = """I don\'t want it, there\'ll be other"""
>>> HTMLParser.HTMLParser().unescape(txt)
"I don't want it, there'll be other"

python3

>>> import sys; sys.version
'3.4.0 (default, Jun 19 2015, 14:20:21) \n[GCC 4.8.2]'
>>> import html
>>> txt = """I don\'t want it, there\'ll be other"""
>>> html.unescape(txt)
"I don't want it, there'll be other"

另请参阅:How do I unescape HTML entities in a string in Python 3.1?