在Python中删除类似的文档

时间:2017-03-20 12:07:26

标签: python algorithm nlp

我有一个带有系列字幕的文件夹。 我希望每集从文件夹中获取一个字幕文件。 我的问题是,有些字幕在同一集中,但名称不同,如

/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.720p.HDTV.x264-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.902.720p.HDTV.x264.MOMENTUM.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.9X02.HDTV.XviD-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.HDTV.XviD-MOMENTUM.srt

因此它们非常相似,但不是100%完全相同。

如何删除重复文档并保留仅使用不同的剧集字幕?
我会附上我尝试的但不幸的是我很无能......

1 个答案:

答案 0 :(得分:6)

您可以在文档之间使用余弦相似性

假设类似的文件具有很高的相似性, 然后您可以应用一个阈值,在该阈值之上,文档将被视为相同。

例如,如果这些是您的文件:

1."The child went home today, and his mother waited for him"
2."My car is big"
3."The kid went to his house today, while his mama waited for him to come"

我使用the answer中的vpekar代码,我执行以下操作:

>>> v1 = text_to_vector("the child went home today, and his mother waited for him")
>>> v2 = text_to_vector("My car is big, so said my mother")
>>> v3 = text_to_vector("The kid went to his house today, while his mama waited for him to come")

和矢量之间的余弦相似度是:

>>> get_cosine(v1,v2)
0.10660035817780521

>>> get_cosine(v1,v3)
0.48420012470625223

>>> get_cosine(v2,v3)
0.0

所以你很明显看到文件1和3是最相似的 - 因此可能是同一集的字幕。 所以,总结一下:

1. you need to apply (n choose 2) comparisons (check every possible pair).
2. If the cosine similarity between 2 documents is higher then a threshold you will find by trial and error - 
    the subtitles are probably of the same episode - and you should remove one of them.