Question

我开始使用nltk与python分析聊天语料库。首先，我想确定最常用的单词，然后我想使用LDA来识别对话的主题。我将文本清理为：

stop = set(stopwords.words('english'))
stop.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[',']', '{', '}', '@', '#', 'http', 'https', '/', '://', '_'])
p_stemmer = PorterStemmer()
texts = ['bla bla xxxxxx', 'hahahah', 'haha xx', ... 'hello xx']

我分析的许多单词包括xxxxx，xxx和x或hahaha或hahahaha。在预处理之后，我获得xxxx和xx等的单独值。如何将xxxx和xx（或hahaha和ha视为同样的话？是否有任何功能可以让我这样做？

Answer 1

使用正则表达式检查所有这些令牌：

import re
re.compile('(?:ha)+|(?:xx)+')

texts = ['bla bla', 'hahahah', 'xxx', 'hello']

pattern = re.compile('(?:ha)+|(?:xx)+')

for t in texts:
 if pattern.match(t):
    print('matched')
 else:
    print('not matched')

此程序将检查 ha 或 xx 出现一次或多次的字词，然后打印匹配的或不匹配因此

使用NLTK

1 个答案: