Question

大家好我有一个csv文件，其中包含以下格式的数据

zf2

第一列包含项目第二列包含来自特征向量的可用特征= [a，b，c，d，e，f，g，h] 我想将此转换为出现矩阵，如下所示

A   a
A   b
B   f
B   g
B   e
B   h
C   d
C   e
C   f

有人能告诉我如何使用熊猫这样做吗？

Answer 1

以下是使用pd.get_dummies()进行此操作的另一种方法。

import pandas as pd

# your data
# =======================
df

  col1 col2
0    A    a
1    A    b
2    B    f
3    B    g
4    B    e
5    B    h
6    C    d
7    C    e
8    C    f

# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)

      a  b  d  e  f  g  h
col1                     
A     1  1  0  0  0  0  0
B     0  0  0  1  1  1  1
C     0  0  1  1  1  0  0

Answer 2

不清楚您的数据是否有拼写错误，但您可以crosstab为此：

In [95]:
pd.crosstab(index=df['A'], columns = df['a'])

Out[95]:
a  b  d  e  f  g  h
A                  
A  1  0  0  0  0  0
B  0  0  1  1  1  1
C  0  1  1  1  0  0

在您的示例数据中，您的第二列的值为a作为该列的名称，但在您的预期输出中，它在列中作为值

修改

好的，我修改了输入数据，以便生成正确的结果：

In [98]: import pandas as pd import io t="""A a A b B f B g B e B h C d C e C f""" df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a']) df Out[98]: A a 0 A a 1 A b 2 B f 3 B g 4 B e 5 B h 6 C d 7 C e 8 C f In [99]: ct = pd.crosstab(index=df['A'], columns = df['a']) ct Out[99]: a a b d e f g h A A 1 1 0 0 0 0 0 B 0 0 0 1 1 1 1 C 0 0 1 1 1 0 0

Answer 3

这种方法在 scipy 稀疏 coo 矩阵中产生相同的结果要快得多

from scipy import sparse

df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
                               (df.col1.cat.codes,
                                df.col2.cat.codes)))

将两列数据帧转换为pandas中的出现矩阵

3 个答案: