通过将标点符号与单词分开而不是缩写和撇号来标记文本

时间:2017-09-18 16:02:56

标签: python nltk tokenize

在输入缩写和撇号的同时,我通过将标点与单词分开来输入我想要标记的文本。我使用的是python和nltk库,但我认为我的正则表达式并不正确,因为我仍然输错了。

# coding: utf-8
import re
import nltk
from nltk.tokenize import *

text = "\"Predictions suggesting that large changes in weight will 
accumulate indefinitely in response to small sustained lifestyle 
modifications rely on the half-century-old 3,500 calorie rule, which 
equates a weight alteration of 2.2 lb to a 3,500 calories cumulative 
deficit or increment,\" write the study authors Dr. Jampolis, Dr. 
Chaudry, and Prof. Harlen, from N.P.C Clinic in OH. The 3,500- calorie 
rule \"predicts that a person who increases daily energy expenditure by 
100 calories by walking 1 mile per day\" will lose 50 pounds over five 
years, the authors say. But the true weight loss is only about 10 
pounds if calorie intake doesn't increase, \"because changes in mass 
... alter the energy requirements of the body’s make-up.\" \"This is a 
myth, strictly speaking, but the smaller amount of weight loss achieved 
with small changes is clinically significant and should not be 
discounted,\" says Dr. Melina Jampolis, CNN diet and fitness expert."

print(regexp_tokenize(text, pattern='(?:(?!\d)\w)+|\S+') )

非常感谢帮助。

1 个答案:

答案 0 :(得分:0)

这应该可以解决问题。在这里使用re.sub来替换任何不需要的标点符号是有意义的(即'')。

s = 'Insert your text here'

new = re.sub(r'(\"\\\")|(\\\")|[.]{3}|,','', s)

print(new)

这个正则表达式的困难部分是逃避所有的反斜杠。打破这个:

(\"\\\")

找到任何" \"

(\\\")

找到任何\"

[.]{3}

找到任何...

,

找到任何,

管道用作'或'运营商。希望这符合您的所有要求。