所以我是文本分析和信息检索的入门者,我想找到90%或更多相似的问题。更具体地说,我正在使用python这样的结构处理pandas数据框>
---------------------------
qid |questiontext |
---------------------------
00001|Why do we exist?
00002|Is there life on Mars?
00003|What happens after death?
.........................
我已经进行了一些预处理,例如删除停用词和阻止词干。我从那里去哪里?比较n ^ 2个问题时,处理时间似乎很长。我应该使用向量模型吗?
欢迎提供任何答案,包括代码示例。谢谢您的时间!
答案 0 :(得分:0)
我不确定您所说的相似性是什么意思,但是如果您要查找字符串相似性,则可以使用fuzzywuzzy
:
from fuzzywuzzy import fuzz, process
# sample data
df = pd.DataFrame({'id':[1,2,3,4], 'text':['fuzzy wuzzy was a bear',
'wuzzy fuzzy was a bear',
'Did someone see a fuzzy bear',
'some string']})
# create a choice list from text column
choices = df['text'].values.tolist()
# apply fuzzywuzzy to each row using lambda expression
# set scorer and limit to whatever is appropriate
df['close string'] = df['text'].apply(lambda x: process.extract(x, choices, limit=2,
scorer=fuzz.ratio)[1])
# boolean indexing to find matches of >= 90
df[df['close string'].apply(lambda x: x[1]) >= 90]
id text close string
0 1 fuzzy wuzzy was a bear (wuzzy fuzzy was a bear, 91)
1 2 wuzzy fuzzy was a bear (fuzzy wuzzy was a bear, 91)
可能不是最快的选择:
2000 loops, best of 3: 2.96 ms per loop