Question

我已经管理得到了所需的最终结果，但必须有一种更有效的方法来做到这一点。让我带你走过：

我有100列有关20个类别的意见数据。

This is what the data looks like

在上图中，HEALTH和JOB是20个类别中的2个。要求候选人对每个类别的个人重要性进行排名。他们要么非常不同意（1），不同意（2），无意见（3），同意（4）或非常同意（5）。

我想要发生的是为每个类别创建一个新列并堆叠值，以便有一列包含候选答案，而不是分布在5列以上。已经预先确定没有候选人给出一个类别的两个答案。上图中的绿色列显示了所需的结果。

这是我采取的低效路线：

数据是使用pandas读入的csv文件。

我为每个类别创建了一个列表，因此有20个列表：

df.columns
health = list([col for col in df.columns if 'HEALTH' in col])
job = list([col for col in df.columns if 'JOB' in col])

然后，我在数据框中创建了20个新列，使用下面的代码从相关列表中的列中获取最大值。

df['HEALTH'] = df[health].max(axis=1)
df['JOB'] = df[job].max(axis=1)

最后一步是删除这100个原始列，并且只有20个新的单独列，并且所有候选答案都已堆叠。

这是使用以下代码使用原始意见列的列表完成的：

df.drop(df[op_cols], axis=1, inplace=True)
df.info()

我正在自学python 2.7，所以对于如何提高这些步骤效率的任何建议/建议都将非常感激。

Answer 1

考虑使用熊猫的重塑过程＆＃39; wide_to_long()。您需要设置一个数值，这里输出为key。当然，重命名最终列（没有下划线）并根据需要按类别排序：

import pandas as pd
import numpy as np

df = pd.DataFrame({'RESPID': [1,1,1,1,1],
                   'HEALTH_SD': [1,np.nan, np.nan, np.nan, np.nan],
                   'HEALTH_D': [np.nan, 2, np.nan, np.nan, np.nan],
                   'HEALTH_N': [np.nan, np.nan, 3, np.nan, np.nan],
                   'HEALTH_A': [np.nan, np.nan, np.nan, 4, np.nan],
                   'HEALTH_SA': [np.nan, np.nan, np.nan, np.nan, 5],
                   'JOB_SD': [1, np.nan, np.nan, np.nan, np.nan],
                   'JOB_D': [np.nan, 3, np.nan, np.nan, np.nan],
                   'JOB_N': [np.nan, np.nan, 2, np.nan, np.nan],
                   'JOB_A': [np.nan, np.nan, np.nan, 5, np.nan],
                   'JOB_SA': [np.nan, np.nan, np.nan, np.nan, 4]})
print df[['RESPID', 'HEALTH_SD', 'HEALTH_D', 'HEALTH_N', 'HEALTH_A', 'HEALTH_SA',
         'JOB_SD', 'JOB_D', 'JOB_N', 'JOB_A', 'JOB_SA']]
#   RESPID  HEALTH_SD  HEALTH_D  HEALTH_N  HEALTH_A  HEALTH_SA  JOB_SD  JOB_D  JOB_N  JOB_A  JOB_SA
#0       1          1       NaN       NaN       NaN        NaN       1    NaN    NaN    NaN     NaN
#1       1        NaN         2       NaN       NaN        NaN     NaN      3    NaN    NaN     NaN
#2       1        NaN       NaN         3       NaN        NaN     NaN    NaN      2    NaN     NaN
#3       1        NaN       NaN       NaN         4        NaN     NaN    NaN    NaN      5     NaN
#4       1        NaN       NaN       NaN       NaN          5     NaN    NaN    NaN    NaN       4

df['KEY'] = 1
rdf = pd.wide_to_long(df, ['HEALTH_', 'JOB_'], i='RESPID', j='CATEG').dropna().reset_index()    
print rdf

#   RESPID CATEG  KEY  HEALTH_  JOB_
#0       1     A    1        4     5
#1       1     D    1        2     3
#2       1     N    1        3     2
#3       1    SA    1        5     4
#4       1    SD    1        1     1

更有效的路线

1 个答案: