我希望通过正则表达式或删除时删除不能或不会删除空格的空格
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
detok = MosesDetokenizer()
pattern= "[^\w ]+ "
text= "i can ' t use this cause they won ' t fit"
string= re.sub(pattern, '', text)
tk = tok.tokenize(string)
output= detok.detokenize(tk, return_str = True)
print(output)
"i can 't use this cause they won' t fit"
关于如何在'can'和'won'之后删除空格的任何想法,所以我可以拥有不能也不会。当我使用output = (' '.join(tk)).strip()
取消说明时,我会得到双倍的空格,一个在撇号之前和之后。示例i can ' t use this cause they won ' t fit
答案 0 :(得分:0)
@BenT我不能说正则表达式但是你的输出你可以应用以下操作:
output = "i can 't use this cause they won' t fit"
output = "'".join(output.split(" '"))
output = "'".join(output.split("' "))
print(output)
"i can't use this cause they won't fit"
还有一线解决方案:
output = output.replace("' ", "'").replace(" '", "'")
print(output)
"i can't use this cause they won't fit"
答案 1 :(得分:0)
我认为你可以做一些简单的事情:
output = "i can 't use this cause they won' t fit"
output = output.replace(" '", "")
print output
"i can't use this cause they won't fit"