我正在研究如何识别具有相同答案的文本问题中的模式。例如,考虑data
数组中的问题,data[0:2]
具有相同的分析"All Tabs" panel code removed in bug 670684
,而data[3:5]
还有另一个分析Please post the related Report IDs from about:crashes.
data = [
"""
If you are navigating through the list of open tabs inside the All+Tabs panel and you wanna filter by a term you have to select the search text field first.
It would be nice if any entered character is automatically routed to the search field and the filter gets applied.
""",
"""
In maximized mode there is something like 3 pixels padding on the right side of "All tabs" panel.
It doesn't exist on the left side of panel and in not maximized mode.
""",
"""
When you have the All+Tabs panel open it would be great if you can press Cmd/Ctrl+F to focus the search text field. Right now the panel gets hidden and the Find toolbar is shown without focus.
IMO using the command inside the All+Tabs panel would make more sense.
""",
"""
Steps to reproduce:
Nothing... had multiple windows and tiles open... for about 4 hours
Actual results:
Crashed without warning
""",
"""
Firefox crashes at leat 6 times a day. Installed latest version but still crashing. Goes very slow before it crashes.
""",
"""
Steps to reproduce:
W have installed Firefox 18 (as we did with all previous version) on Solaris 10 SPAC 64b
Actual results:
When we tried to start it form a console, it crached with a message: Segmentation fault.
And it produced a core dump
Expected results:
Firefox should have open correctly
"""
]
使用scikit学习评估这些问题的文本相似性(tf-idf向量和余弦相似性)没有帮助。例如,即使问题[0:2]具有相同的解决方案,它们的最大相似度仅为0.27%
。
因此,我想使用其他方法来识别所提出问题中的模式,以便如果出现具有相似特征的未来问题,我可以推荐相同的分析。理想情况下,这种模式就像提及“全部+标签”面板一样。或者该应用程序崩溃了'。
假设此时我可以使用类似的分析对问题(非结构化文本文件)进行聚类。您会建议采用哪种策略来识别问题阵列中的可能模式?
编辑:添加了关于如何计算问题的文本相似性的简要说明。