给出以下输入:
Text id
$aKropotkin$bPetr Alekseevich$cKniaz',$f1842-1921. 34
$aKropotkin$bPetr Alekseevich$cKniaz',$f1842-1921. 98
$aKropotkin$bPetr Alekseevich$ckniaz',$f1842-1921. 152
$aKropotkin$bPetr Alekseevich$ckniaz',$f1842-1921. 245
$aKropotkin$bPetr Alekseevich$ckniaz,$f1842-1921 365
$aKropotkin$bPetr Alekseevich$ckniaz,$f1842-1921. 654
$aDescartes$bRene$f1596-1650. 964
$aDescartes$bRene$f1596-1650. 1364
$aDescartes$bRene$f1596-1650. 2547
$aDescartes$bRene$f1596-1650. 3547
$aDescartes$bRene$f1596-1650. 3678
$aDescartes$bRene$f1596-1650 54656
$aDescartes$bRené$f1596-1650 698545
$aDescartes$bRené$f1596-1650. 65455233
$aVoltaire,$f1694-1778. 54666
$aVoltaire,$f1694-1778 365421
$aVoltaire$f1694-1778. 654564
我只需要创建真正重复的集群。根据第一栏的文字。
我尝试使用以下示例代码,但所有文本都进入群集: https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/
我需要一种方法来使用一种根本没有误报的算法,输出类似于:
群集1:
$aKropotkin$bPetr Alekseevich$cKniaz',$f1842-1921. 34,98,152,245,365,654
$aKropotkin$bPetr Alekseevich$ckniaz',$f1842-1921.
$aKropotkin$bPetr Alekseevich$ckniaz,$f1842-1921
$aKropotkin$bPetr Alekseevich$ckniaz,$f1842-1921.
群集2:
$aDescartes$bRene$f1596-1650. 964,1364,2547,3547,3678,54656,698545,65455233
$aDescartes$bRene$f1596-1650.
$aDescartes$bRene$f1596-1650.
$aDescartes$bRene$f1596-1650.
$aDescartes$bRene$f1596-1650.
$aDescartes$bRene$f1596-1650
$aDescartes$bRené$f1596-1650
$aDescartes$bRené$f1596-1650.
群集3:
$aVoltaire,$f1694-1778. 54666,365421,654564
$aVoltaire,$f1694-1778
$aVoltaire$f1694-1778.
编辑:
我尝试过的,我认为这是我最接近我尝试做的事情:(但我在这里要求一个优雅高效的解决方案)
# -*- coding: utf-8 -*-
import re, string
from unidecode import unidecode
PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))
class Fingerprinter(object):
'''
Python implementation of Google Refine fingerprinting algorithm described here:
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
Requires the unidecode module: https://github.com/iki/unidecode
'''
def __init__(self, string):
self.string = self._preprocess(string)
def _preprocess(self, string):
'''
Strip leading and trailing whitespace, lowercase the string, remove all punctuation,
in that order.
'''
return PUNCTUATION.sub('', string.strip().lower())
def _latinize(self, string):
'''
Replaces unicode characters with closest Latin equivalent. For example,
Alejandro González Iñárritu becomes Alejando Gonzalez Inarritu.
'''
return unidecode(string.decode('utf-8'))
def _unique_preserving_order(self, seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def get_fingerprint(self):
'''
Gets conventional fingerpint.
'''
return self._latinize(' '.join(
self._unique_preserving_order(
sorted(self.string.split())
)
))
def get_ngram_fingerprint(self, n=1):
'''
Gets ngram fingerpint based on n-length shingles of the string.
Default is 1.
'''
return self._latinize(''.join(
self._unique_preserving_order(
sorted([self.string[i:i + n] for i in range(len(self.string) - n + 1)])
)
))
if __name__ == '__main__':
f = Fingerprinter('Tom Cruise')
print f.get_fingerprint()
print f.get_ngram_fingerprint(n=1)
f = Fingerprinter('Cruise, Tom')
print f.get_fingerprint()
print f.get_ngram_fingerprint(n=1)
f = Fingerprinter('Paris')
print f.get_fingerprint()
print f.get_ngram_fingerprint(n=2)