Question

我希望根据一列中的值更新许多列;循环这很容易，但是当有很多列和很多行时，我的应用程序需要太长时间。获得每个字母所需数量的最优雅方法是什么？

期望的输出：

   Things         count_A     count_B    count_C     count_D
['A','B','C']         1            1         1          0
['A','A','A']         3            0         0          0
['B','A']             1            1         0          0
['D','D']             0            0         0          2

Answer 1

最优雅的是来自sklearn的CountVectorizer。

我会告诉你它是如何工作的，然后我会在一行中做所有事情，所以你可以看到它是多么优雅。

首先，我们将逐步完成：

让我们创建一些数据

raw = ['ABC', 'AAA', 'BA', 'DD']

things = [list(s) for s in raw]

然后读入一些包并初始化count vectorizer

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)

接下来，我们生成一个计数矩阵

matrix = cv.fit_transform(things)

names = ["count_"+n for n in cv.get_feature_names()]

并另存为数据框

df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)

生成如下数据框：

    count_A count_B count_C count_D
ABC 1   1   1   0
AAA 3   0   0   0
BA  1   1   0   0
DD  0   0   0   2

优雅版：

一行中的所有内容

df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

定时：

您提到您正在使用相当大的数据集，因此我使用%% timeit函数来估算时间。

以前的回复@piRSquared（看起来非常好！）

pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)

100 loops, best of 3: 3.27 ms per loop

我的回答：

pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

1000 loops, best of 3: 1.08 ms per loop

根据我的测试， CountVectorizer 的速度提高了约3倍。

Answer 2

选项1
apply + value_counts

s = pd.Series([list('ABC'), list('AAA'), list('BA'), list('DD')], name='Things')

pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)

选项2
使用pd.DataFrame(s.tolist()) + stack / groupby / unstack

pd.concat([s,
           pd.DataFrame(s.tolist()).stack() \
             .groupby(level=0).value_counts() \
             .unstack(fill_value=0)],
          axis=1)

根据列值和其他列更新Pandas单元格

2 个答案:

首先，我们将逐步完成：

优雅版：

定时：