Question

我正试图在大熊猫中获得相关性，这给了我一些困难。基本上我想回答以下问题：给定一个句子，一个值和一个数据框，哪个词与最高值相关联？最糟糕的是什么？

琐碎的例子：

Sentence      | Score
"hello there" | 100
"hello kid"   | 95
"there kid"   | 5

我期待在这里看到“hello”和得分这个词的高相关值。希望这是有道理的 - 如果这在Pandas原生可能，我真的很感激知道！

如果有任何不清楚的地方，请告诉我。

Answer 1

这是一种方式。取每个字符串中每个单词出现的平均分数。例如“你好”收到97.5，“那里”收到52.5 [（100 + 5）/ 2]等。

from collections import defaultdict
import numpy as np

df = pd.DataFrame.from_dict({'Score': {0: 100, 1: 95, 2: 5},
                             'Sentence': {0: 'hello there', 1: 'hello kid', 2: 'there kid'}})

df['WordList'] = df['Sentence'].str.split(' ')

d = defaultdict(list)

for idx, row in df.iterrows():
    for word in row['WordList']:
        d[word].append(row['Score'])

d = {k: np.mean(v) for k, v in d.items()}

{'hello': 97.5, 'there': 52.5, 'kid': 50.0}

Answer 2

我不确定pandas是您要找的，但是，您可以：

import pandas as pd

df = pd.DataFrame([ ["hello there", 100],
                    ["hello kid",   95],
                    ["there kid",   5]
                  ], columns = ['Sentence','Score'])

s_corr = df.Sentence.str.get_dummies(sep=' ').corrwith(df.Score/df.Score.max())
print (s_corr)

会回复你

hello    0.998906
kid     -0.539949
there   -0.458957

有关详情，请参阅pandas帮助

有没有办法与大熊猫中的字符串数据和数值相关联？

2 个答案: