使用来自另一列的信息在pandas列上应用函数

时间:2017-05-29 07:18:03

标签: python pandas apply

我有一个数据框,其中包含一堆人的文字说明。除此之外,我还有4个描述a,b,c,d。对于每个人的文本描述,我希望通过使用余弦相似性将它们与4个描述中的每一个进行比较,并将这些分数存储在4个新列中的相同数据帧中:a,b,c,d。

如何在不使用for循环的情况下以熊猫方式执行此操作?我正在考虑使用apply函数,但我不知道如何引用'text'列以及apply函数中的4个描述a,b,c,d。

非常感谢您的帮助!!

我尝试过:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

person_one = [' '.join(['table','car','mouse'])]
person_two = [' '.join(['computer','card','can','mouse'])]
person_three = [' '.join(['chair','table','whiteboard','window','button'])]
person_four = [' '.join(['queen','king','joker','phone'])]

description_a = [' '.join(['table','yellow','car','king'])]
description_b = [' '.join(['bottle','whiteboard','queen'])]
description_c = [' '.join(['chair','car','car','phone'])]
description_d = [' '.join(['joker','blue','earphone','king'])]

mystuff = [('person 1',person_one),
           ('person 2',person_two),
           ('person 3',person_three),
           ('person 4',person_four)
           ]

labels = ['person','text']

df = pd.DataFrame.from_records(mystuff,columns = labels)
df = df.reindex(columns = ['person','text','a','b','c','d'])

def trying(cell,jd):
    vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd)
    jd_vector = vectorizer.transform(jd)
    person_vector = vectorizer.transform(cell['text'])
    score = cosine_similarity(jd_vector,person_vector)

    return score


df['a'] = df['a'].apply(trying(description_a))
df['b'] = df['b'].apply(trying(description_b))
df['c'] = df['c'].apply(trying(description_c))
df['d'] = df['d'].apply(trying(description_d))

这给了我一个错误:

df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'

输出应该如下所示:

     person                                        text   a   b   c   d
0  person 1                         [table, car, mouse] 0.3 0.2 0.5 0.7
1  person 2                [computer, card, can, mouse] 0.2 0.1 0.9 0.7
2  person 3  [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4
3  person 4                 [queen, king, joker, phone] 0.2 0.4 0.3 0.5

2 个答案:

答案 0 :(得分:3)

我还不能发表评论,但要解决错误:

df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'

你需要传递这样的参数:

df['a'] = df['a'].apply(trying, args=(description_a))

第一个参数将是你的情况下的列向量,然后其他参数将按照args列表的顺序进行。

希望得到这个帮助。

答案 1 :(得分:0)

这个怎么样:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


person_one = ['table','car','mouse']
person_two = ['computer','card','can','mouse']
person_three = ['chair','table','whiteboard','window','button']
person_four = ['queen','king','joker','phone']

description_a = ['table','yellow','car','king']
description_b = ['bottle','whiteboard','queen']
description_c = ['chair','car','car','phone']
description_d = ['joker','blue','earphone','king']

descriptors = {
    'a' : description_a,
    'b' : description_d,
    'c' : description_c,
    'd' : description_d
}

mystuff = [('person 1',person_one),
           ('person 2',person_two),
           ('person 3',person_three),
           ('person 4',person_four)
           ]

labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)

vocabulary_data =[
    person_one,
    person_two,
    person_three,
    person_four,
    description_a,
    description_b,
    description_c,
    description_d,
]

data = [set(sentence) for sentence in vocabulary_data]
vocabulary = set.union(*data)
cv = CountVectorizer(vocabulary=vocabulary)


def similarity(row, desc):
    a = cosine_similarity(cv.fit_transform(row['text']).sum(axis=0), cv.fit_transform(desc).sum(axis=0))
    return a.item()

for key, description in descriptors.items():
    df[key] = df.apply(lambda x: similarity(x, description), axis=1)

我使用了一个for循环,但仅用于填充不同的描述。主要"计算"通过申请完成。