如何从一组字符串中智能地删除类似的部分?

时间:2015-03-19 18:58:18

标签: python html parsing nlp

例如,这就是我想要输入的内容:

input = [
   '<html><head><title>Albert Einstein - Minipedia</title></head><body><b>Welcome to Minipedia! You are viewing page 1</b> Albert Einstein was a scientist</body></html>',
   '<html><head><title>Ludwig Van Beethoven - Minipedia</title></head><body><b>Welcome to Minipedia! You are viewing page 2</b> Ludwig van Beethoven was a Musician</body></html>',
   '<html><head><title>Red - Minipedia</title></head><body><b>Welcome to Minipedia! You are viewing page 3</b> Red is a color.</body></html>'
]

我正在寻找的输出是:

output = [
    ['Albert Einstein', 'Albert Einstein was a scientist'],
    ['Ludwig Van Beethoven', 'Ludwig Van Beethoven was a musician'],
    ['Red', 'Red is a color']
]

我正在寻找的逻辑是,如果每个文档的子串都有重要的重叠(即足够小的编辑距离),它们应该被取出并用于标记具有足够差异的剩余字符串。

有没有这个库?

0 个答案:

没有答案