我正在通过代码手动计算tf-idf值。然后,我尝试使用sklearn tfidf矢量化程序进行检查。奇怪的是,这些值不匹配,我似乎也找不到原因。以下是我编写的代码和示例数据。有人可以解释我在做什么错吗? 我将每个字符串作为每个文档,并将其中的单词作为标记。
data = ['aron carey', 'renee marie parilak teate',
'colt carey', 'renee m teate',
'ciara acton', 'adrian adams', 'tyson', 'tyson mclure']
def CalcTfIdf(data):
tfIdf1 = []
vecLst1 = []
count = 0
for s1 in data:
normVal1 = 0
count = count + 1
for tok in s1.split():
df1 = 0
#calculate term freq
tf = s1.split().count(tok)
#calculate doc frequency
for l in data:
df1 = df1 + list(set(l.split())).count(tok)
idf = math.log10(len(data)/(1 + df1))
tfIdf = tf * idf
tfIdf1.append((count, tok, tfIdf))
def CalcTfIdfLibrary(data):
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(data)
print(X)
第7个文档,其中仅包含一个单词“ Tyson”,sklearn库的tf-idf值为1,根据我的计算,为0.42。怎么会是1?
结果1: [(1,'aron',0.6020599913279624),(1,'carey,0.4259687322722811),(2,'renee',0.4259687322722811),(2,'marie',0.6020599913279624),(2,'parilak',0.6020599913279624) ,(2,'teate',0.4259687322722811),(3,'colt',0.6020599913279624),(3,'carey',0.4259687322722811),(4,'renee',0.4259687322722811),(4,'m',0.6020599913279624) ,(4,'teate',0.4259687322722811),(5,'ciara',0.6020599913279624),(5,'acton',0.6020599913279624),(6,'adrian',0.6020599913279624),(6,'adams',0.6020599913279624) ,(7,'tyson',0.4259687322722811),(8,'tyson',0.4259687322722811),(8,'mclure',0.6020599913279624)]
结果2: (0,3)0.7664298449085388 (0,4)0.6423280258820045 (1,11)0.45419450284733365 (1,8)0.5419477406385818 (1,10)0.5419477406385818 (1,12)0.45419450284733365 (2,4)0.6423280258820045 (2,6)0.7664298449085388 (3,11)0.540442546068122 (3,12)0.540442546068122 (3,7)0.6448594488714666 (4,5)0.7071067811865475 (4,0)0.7071067811865475 (5,2)0.7071067811865475 (5,1)0.7071067811865475 (6、13)1.0 (7,13)0.6423280258820045 (7,9)0.7664298449085388