我有一个数据框,其中包含一堆人的文字说明。除此之外,我还有4个描述a,b,c,d。对于每个人的文本描述,我希望通过使用余弦相似性将它们与4个描述中的每一个进行比较,并将这些分数存储在4个新列中的相同数据帧中:a,b,c,d。
如何在不使用for循环的情况下以熊猫方式执行此操作?我正在考虑使用apply函数,但我不知道如何引用'text'列以及apply函数中的4个描述a,b,c,d。
非常感谢您的帮助!!
我尝试过:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = [' '.join(['table','car','mouse'])]
person_two = [' '.join(['computer','card','can','mouse'])]
person_three = [' '.join(['chair','table','whiteboard','window','button'])]
person_four = [' '.join(['queen','king','joker','phone'])]
description_a = [' '.join(['table','yellow','car','king'])]
description_b = [' '.join(['bottle','whiteboard','queen'])]
description_c = [' '.join(['chair','car','car','phone'])]
description_d = [' '.join(['joker','blue','earphone','king'])]
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
df = df.reindex(columns = ['person','text','a','b','c','d'])
def trying(cell,jd):
vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd)
jd_vector = vectorizer.transform(jd)
person_vector = vectorizer.transform(cell['text'])
score = cosine_similarity(jd_vector,person_vector)
return score
df['a'] = df['a'].apply(trying(description_a))
df['b'] = df['b'].apply(trying(description_b))
df['c'] = df['c'].apply(trying(description_c))
df['d'] = df['d'].apply(trying(description_d))
这给了我一个错误:
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
输出应该如下所示:
person text a b c d
0 person 1 [table, car, mouse] 0.3 0.2 0.5 0.7
1 person 2 [computer, card, can, mouse] 0.2 0.1 0.9 0.7
2 person 3 [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4
3 person 4 [queen, king, joker, phone] 0.2 0.4 0.3 0.5
答案 0 :(得分:3)
我还不能发表评论,但要解决错误:
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
你需要传递这样的参数:
df['a'] = df['a'].apply(trying, args=(description_a))
第一个参数将是你的情况下的列向量,然后其他参数将按照args列表的顺序进行。
希望得到这个帮助。
答案 1 :(得分:0)
这个怎么样:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = ['table','car','mouse']
person_two = ['computer','card','can','mouse']
person_three = ['chair','table','whiteboard','window','button']
person_four = ['queen','king','joker','phone']
description_a = ['table','yellow','car','king']
description_b = ['bottle','whiteboard','queen']
description_c = ['chair','car','car','phone']
description_d = ['joker','blue','earphone','king']
descriptors = {
'a' : description_a,
'b' : description_d,
'c' : description_c,
'd' : description_d
}
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
vocabulary_data =[
person_one,
person_two,
person_three,
person_four,
description_a,
description_b,
description_c,
description_d,
]
data = [set(sentence) for sentence in vocabulary_data]
vocabulary = set.union(*data)
cv = CountVectorizer(vocabulary=vocabulary)
def similarity(row, desc):
a = cosine_similarity(cv.fit_transform(row['text']).sum(axis=0), cv.fit_transform(desc).sum(axis=0))
return a.item()
for key, description in descriptors.items():
df[key] = df.apply(lambda x: similarity(x, description), axis=1)
我使用了一个for循环,但仅用于填充不同的描述。主要"计算"通过申请完成。