比较上千个问题是否具有高度相似性的有效方法是什么?

时间:2019-01-31 01:31:49

标签: python pandas nlp information-retrieval

所以我是文本分析和信息检索的入门者,我想找到90%或更多相似的问题。更具体地说,我正在使用python这样的结构处理pandas数据框

---------------------------
qid  |questiontext          |
---------------------------

00001|Why do we exist?
00002|Is there life on Mars?
00003|What happens after death?
.........................

我已经进行了一些预处理,例如删除停用词和阻止词干。我从那里去哪里?比较n ^ 2个问题时,处理时间似乎很长。我应该使用向量模型吗?

欢迎提供任何答案,包括代码示例。谢谢您的时间!

1 个答案:

答案 0 :(得分:0)

我不确定您所说的相似性是什么意思,但是如果您要查找字符串相似性,则可以使用fuzzywuzzy

from fuzzywuzzy import fuzz, process

# sample data
df = pd.DataFrame({'id':[1,2,3,4], 'text':['fuzzy wuzzy was a bear',
                                             'wuzzy fuzzy was a bear',
                                             'Did someone see a fuzzy bear',
                                             'some string']})

# create a choice list from text column
choices = df['text'].values.tolist()

# apply fuzzywuzzy to each row using lambda expression
# set scorer and limit to whatever is appropriate
df['close string'] = df['text'].apply(lambda x: process.extract(x, choices, limit=2,
                                                                scorer=fuzz.ratio)[1])

# boolean indexing to find matches of >= 90
df[df['close string'].apply(lambda x: x[1]) >= 90]

    id                   text                  close string
0   1   fuzzy wuzzy was a bear  (wuzzy fuzzy was a bear, 91)
1   2   wuzzy fuzzy was a bear  (fuzzy wuzzy was a bear, 91)

fuzzywuzzy documentation

可能不是最快的选择:

2000 loops, best of 3: 2.96 ms per loop