Question

我有一个包含文本和结果的数据框

             Text    Result
0  some text...      True
1  another one...    False

我有一个从文本中进行特征提取的函数 - 返回带有大约1000个键的字典，这些键是单词和T / F值，具体取决于单词是否在文本中。

words = ["some", "text", "another", "one", "other", "words"]
def extract(text):
      result = dict()
      for w in words:
             result[w] = (w in text)
      return result

我期待的结果是

             Text    some   text  another one    other  words  Result
0  some text...      True   True  False   False  False  False  True
1  another one...    False  False True    True   False  False  False

但我不知道如何在数据框架上应用此功能？到目前为止我所做的是创建具有默认False值的列，但我不知道如何使用True值填充它。

for feature in words:
    df[feature] = False

我想在熊猫中有更好的方法吗？

Answer 1

将pd.Series.str.get_dummies与pd.DataFrame.reindex

一起使用

exp = (
    df.Text.str.get_dummies(' ')
      .reindex(columns=words, fill_value=0)
      .astype(bool)
)

df.drop('Result', 1).join(exp).join(df.Result)

          Text   some   text  another    one  other  words  Result
0    some text   True   True    False  False  False  False    True
1  another one  False  False     True   True  False  False   False

解释

get_dummies为找到的每个单词提供虚拟列，这很简单。但是，我使用reindex来表示我们关心的所有单词。 fill_value和astype(bool)用于匹配OP输出。我使用drop和join(df.Result)作为将Result放到数据帧末尾的简洁方法。

Answer 2

您可以apply对数据框列的函数，如下所示：

def func(): # some function that you want to apply to each row in a column
    return None

new_row = df['column_name'].apply(func)

之后，您可以将new_row附加到现有数据框。

还有一个similar function，但用于将函数应用于整个数据帧。

编辑：

df = pd.DataFrame(['some text...', 'another one...'], columns=['Text'])
words = ["some", "text", "another", "one", "other", "words"]
def extract(text):
      result = dict()
      for w in words:
             result[w] = (w in text)
      return result.values()
new_cols = pd.DataFrame(df['Text'].apply(extract), columns=words)
result_df = pd.concat([df, new_cols], axis=1)

执行添加列并填充它们的函数依赖于Pandas中的其他列

2 个答案: