我有下面的python列表
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
现在我需要阻止它(每个单词)并获得另一个列表。我怎么做 ?
答案 0 :(得分:34)
from stemming.porter2 import stem
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]
我们在这里做的是使用list comprehension循环遍历主列表中的每个字符串,将其拆分为单词列表。然后我们循环遍历该列表,在我们去的时候产生每个单词,返回新的词干列表。
请注意我没有尝试使用已安装的词干 - 我从评论中删除了它,并且从未使用过它。然而,这是将列表拆分为单词的基本概念。请注意,这将生成一个单词列表列表,保持原始分隔。
如果不想要这种分离,你可以这样做:
documents = [stem(word) for sentence in documents for word in sentence.split(" ")]
相反,它将为您留下一个连续的列表。
如果你想在最后加入这些词,你可以这样做:
documents = [" ".join(sentence) for sentence in documents]
或者在一行中完成:
documents = [" ".join([stem(word) for word in sentence.split(" ")]) for sentence in documents]
保留句子结构,或
documents = " ".join(documents)
忽略它的地方。
答案 1 :(得分:8)
您可能想看一下NLTK(自然语言工具包)。它有一个模块nltk.stem,其中包含各种不同的词干分析器。
答案 2 :(得分:4)
好的。因此,使用stemming包,您将拥有以下内容:
from stemming.porter2 import stem
from itertools import chain
def flatten(listOfLists):
"Flatten one level of nesting"
return list(chain.from_iterable(listOfLists))
def stemall(documents):
return flatten([ [ stem(word) for word in line.split(" ")] for line in documents ])
答案 3 :(得分:3)
您可以使用NLTK:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
final = [[ps.stem(token) for token in sentence.split(" ")] for sentence in documents]
NLTK为IR Systems提供了许多功能,请检查
答案 4 :(得分:1)
你可以使用嗖嗖:(http://whoosh.readthedocs.io/)
from whoosh.analysis import CharsetFilter, StemmingAnalyzer
from whoosh import fields
from whoosh.support.charset import accent_map
my_analyzer = StemmingAnalyzer() | CharsetFilter(accent_map)
tokens = my_analyzer("hello you, comment ça va ?")
words = [token.text for token in tokens]
print(' '.join(words))
答案 5 :(得分:1)
from nltk.stem import PorterStemmer
ps = PorterStemmer()
list_stem = [ps.stem(word) for word in list]
答案 6 :(得分:0)
您可以使用PorterStemmer或LancasterStemmer来进行词干。