在输入缩写和撇号的同时,我通过将标点与单词分开来输入我想要标记的文本。我使用的是python和nltk库,但我认为我的正则表达式并不正确,因为我仍然输错了。
# coding: utf-8
import re
import nltk
from nltk.tokenize import *
text = "\"Predictions suggesting that large changes in weight will
accumulate indefinitely in response to small sustained lifestyle
modifications rely on the half-century-old 3,500 calorie rule, which
equates a weight alteration of 2.2 lb to a 3,500 calories cumulative
deficit or increment,\" write the study authors Dr. Jampolis, Dr.
Chaudry, and Prof. Harlen, from N.P.C Clinic in OH. The 3,500- calorie
rule \"predicts that a person who increases daily energy expenditure by
100 calories by walking 1 mile per day\" will lose 50 pounds over five
years, the authors say. But the true weight loss is only about 10
pounds if calorie intake doesn't increase, \"because changes in mass
... alter the energy requirements of the body’s make-up.\" \"This is a
myth, strictly speaking, but the smaller amount of weight loss achieved
with small changes is clinically significant and should not be
discounted,\" says Dr. Melina Jampolis, CNN diet and fitness expert."
print(regexp_tokenize(text, pattern='(?:(?!\d)\w)+|\S+') )
非常感谢帮助。
答案 0 :(得分:0)
这应该可以解决问题。在这里使用re.sub来替换任何不需要的标点符号是有意义的(即'')。
s = 'Insert your text here'
new = re.sub(r'(\"\\\")|(\\\")|[.]{3}|,','', s)
print(new)
这个正则表达式的困难部分是逃避所有的反斜杠。打破这个:
(\"\\\")
找到任何" \"
(\\\")
找到任何\"
[.]{3}
找到任何...
,
找到任何,
管道用作'或'运营商。希望这符合您的所有要求。