我有几个单词的文字。我想删除单词的所有衍生扩展。例如,我想删除扩展名-eding并保留初始动词。如果我有验证或验证保持验证f.e.我在python中找到了方法条,它从字符串的开头或结尾删除了一个特定的字符串,但这不是我想要的。有没有在python中做这样的事情的图书馆?
我试图从提议的帖子中执行代码,并且我注意到了几个字的奇怪修剪。例如,我有以下文字
We goin all the way βπƒβ΅οΈβ΅οΈ
Think ive caught on to a really good song ! Im writing π
Lookin back on the stuff i did when i was lil makes me laughh π‚
I sneezed on the beat and the beat got sicka
#nashnewvideo http://t.co/10cbUQswHR
Homee βοΈβοΈβοΈπ΄
So much respect for this man , truly amazing guy βοΈ @edsheeran
http://t.co/DGxvXpo1OM"
What a day ..
RT @edsheeran: Having some food with @ShawnMendes
#VoiceSave christina π
Im gunna make the βοΈ sign my signature pose
You all are so beautiful .. π soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention"""
使用以下代码后(我也删除非拉丁字符和网址)
we goin all the way
think ive caught on to a realli good song im write
lookin back on the stuff i did when i wa lil make me laughh
i sneez on the beat and the beat got sicka
nashnewvideo
home
so much respect for thi man truli amaz guy
what a day
rt have some food with
voicesav christina
im gunna make the sign my signatur pose
you all are so beauti soooo beauti
thought that wa a realli awesom quot
beauti thing dont ask for attent
例如,它修剪美丽到美丽,并引用真正的realli。我的代码如下:
reader = csv.reader(f)
print doc
for row in reader:
text = re.sub(r"(?:\@|https?\://)\S+", "", row[2])
filter(lambda x: x in string.printable, text)
out = text.translate(string.maketrans("",""), string.punctuation)
out = re.sub("[\W\d]", " ", out.strip())
word_list = out.split()
str1 = ""
for verb in word_list:
verb = verb.lower()
verb = nltk.stem.porter.PorterStemmer().stem_word(verb)
str1 = str1+" "+verb+" "
list.append(str1)
str1 = "\n"
答案 0 :(得分:3)
相反stemmer
您可以使用lemmatizer
。这是python NLTK的一个例子:
from nltk.stem import WordNetLemmatizer
s = """
You all are so beautiful soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention
"""
wnl = WordNetLemmatizer()
print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention
在某些情况下,它可能无法达到预期效果:
print wnl.lemmatize('going') #going
然后,您可以将两种方法结合使用:stemming
和lemmatization
。
答案 1 :(得分:3)
你的问题有点笼统,但如果你有一个已定义的静态文本,最好的办法就是编写自己的stemmer
。因为Porter
和Lancaster
词干分子遵循他们自己的剥离词缀规则,WordNet lemmatizer
只会删除词缀,如果结果词在词典中。
你可以这样写:
import re
def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
def stemmer(phrase):
for word in phrase:
if stem(word):
print re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', word)
因此,对于“处理流程”,您将拥有:
>> stemmer('processing processes')
[('process', 'ing'),('process', 'es')]