大熊猫长到没有多维表

时间:2017-12-10 15:19:28

标签: pandas

我有一张看起来像这样的桌子。完整的示例代码和源数据位于底部,包括提供的两个答案。

           id   technology     question  response
0        subj1  technology1       Q1         3  
1        subj1  technology2       Q1         4
...
10       subj1  technology3      Q3          6
11       subj1  technology4       Q3         2
12       subj1  technology4       Q4         7
13       subj1  technology3       Q4         5
14       subj1  technology1       Q4         5
15       subj1  technology2       Q4         9
16       subj2  technology2       Q1         1
17       subj2  technology1       Q1         4
...
29       subj2  technology3       Q4         0

我想要的是一张表格,其中包含“问题”的不同值。列成为自己的列,并且问题列单元格中的值是给定主题和技术的给定问题的响应值,如下所示(仅举例说明):

         id      technology      Q1 Q2  Q3  Q4
0        subj1  technology1      3  3   2   1
1        subj1  technology2      4  4   3   1
...
10       subj1  technology3      6  3   7   2
...
16       subj2  technology2      4  5   7   3

如果我像这样转动表格,我可以接近这一点(根据目前的建议,请注意以下完整代码中的改进版本):

source_data_df_pvt1 = pd.pivot_table(source_data_df, index = ['id'],  
                                    columns = ['technology', 'question'], 
                                    values = 'response', aggfunc='first')

这给了我这个多维数据框:

technology technology1                   technology2                      technology3                technology4                technology5
question            Q1   Q2    Q3     Q4          Q1    Q2   Q3    Q4            Q1  Q2   Q3    Q4          Q1   Q2   Q3   Q4          Q1
id                                                                                                                                     
subj1              3.0   9.0   7.0   5.0         4.0   5.0  3.0   9.0           3.0  8.0  6.0  5.0         5.0  8.0  2.0  7.0         NaN
subj2              4.0   9.0   8.0   7.0         1.0   5.0  8.0  20.0        20.0  9.0  4.0  0.0         3.0  0.0  8.0  6.0         NaN
subj3             14.0   NaN  10.0   0.0         NaN   NaN  NaN   NaN         NaN  NaN  NaN  NaN         NaN  NaN  NaN  NaN        15.0
subj4             13.0   4.0   5.0  11.0        17.0   NaN  NaN   NaN         0.0  NaN  NaN  NaN         0.0  NaN  NaN  NaN         NaN
subj5              3.0  20.0   4.0   8.0         2.0  20.0  3.0   2.0         3.0  5.0  7.0  5.0         4.0  2.0  7.0  5.0         NaN
subj6              2.0   8.0   1.0   6.0         0.0   7.0  4.0   1.0        20.0  6.0  1.0  0.0         6.0  8.0  7.0  3.0         NaN

我不希望我的数据帧是多维的,我只想要一个维度表。

Pandas可以吗?

完整的示例代码和输出,包括工作解决方案:

import pandas as pd
import numpy as np

pd.set_option('display.width', 1000)

#See https://gist.github.com/NathanDotTo/e506c1946c23234d2c24a2bd27e570a0#file-technology_skills-csv
sample_data = "../test_data/technology_skills.csv"

#The sample data has these columns:
column_names = ["id", "technology", "question", "response"]

source_data_df = pd.read_csv(sample_data, names = column_names, header = None)

#Change response to be numeric
pd.to_numeric(source_data_df.response)

#Strip white spaces from questions
source_data_df['question'] = source_data_df['question'].str.strip()

#Pivot the table to create columns from response values for each question     

source_data_df_pvt1 = pd.pivot_table(source_data_df, index = ['id','technology'],  
                                columns = [ 'question'], 
                                values = 'response', 
                                aggfunc='first',
                                fill_value=np.nan).reset_index()

print('source_data_df_pvt1 *******************')
print (source_data_df_pvt1)

source_data_df_pvt2 = source_data_df.pivot_table(index=['id','technology'],
                                 columns='question',
                                 values='response',
                                 aggfunc='sum',
                                 fill_value=np.nan).reset_index()

print('source_data_df_pvt2 *******************')
print (source_data_df_pvt2)

结果如下:

source_data_df_pvt1 *******************
question     id   technology  Q1    Q2    Q3    Q4
0         subj1  technology1   3   9.0   7.0   5.0
1         subj1  technology2   4   5.0   3.0   9.0
2         subj1  technology3   3   8.0   6.0   5.0
3         subj1  technology4   5   8.0   2.0   7.0
4         subj2  technology1   4   9.0   8.0   7.0
5         subj2  technology2   1   5.0   8.0  20.0
6         subj2  technology3  20   9.0   4.0   0.0
7         subj2  technology4   3   0.0   8.0   6.0
8         subj3  technology1  14   NaN  10.0   0.0
9         subj3  technology5  15   NaN   NaN   NaN 
10        subj4  technology1  13   4.0   5.0  11.0
11        subj4  technology2  17   NaN   NaN   NaN
12        subj4  technology3   0   NaN   NaN   NaN
13        subj4  technology4   0   NaN   NaN   NaN
14        subj5  technology1   3  20.0   4.0   8.0
15        subj5  technology2   2  20.0   3.0   2.0
16        subj5  technology3   3   5.0   7.0   5.0
17        subj5  technology4   4   2.0   7.0   5.0
18        subj6  technology1   2   8.0   1.0   6.0
19        subj6  technology2   0   7.0   4.0   1.0
20        subj6  technology3  20   6.0   1.0   0.0
21        subj6  technology4   6   8.0   7.0   3.0

source_data_df_pvt2 *******************
question     id   technology  Q1    Q2    Q3    Q4
0         subj1  technology1   3   9.0   7.0   5.0
1         subj1  technology2   4   5.0   3.0   9.0
2         subj1  technology3   3   8.0   6.0   5.0
3         subj1  technology4   5   8.0   2.0   7.0
4         subj2  technology1   4   9.0   8.0   7.0
5         subj2  technology2   1   5.0   8.0  20.0
6         subj2  technology3  20   9.0   4.0   0.0
7         subj2  technology4   3   0.0   8.0   6.0
8         subj3  technology1  14   NaN  10.0   0.0
9         subj3  technology5  15   NaN   NaN   NaN
10        subj4  technology1  13   4.0   5.0  11.0
11        subj4  technology2  17   NaN   NaN   NaN
12        subj4  technology3   0   NaN   NaN   NaN
13        subj4  technology4   0   NaN   NaN   NaN
14        subj5  technology1   3  20.0   4.0   8.0
15        subj5  technology2   2  20.0   3.0   2.0
16        subj5  technology3   3   5.0   7.0   5.0
17        subj5  technology4   4   2.0   7.0   5.0
18        subj6  technology1   2   8.0   1.0   6.0
19        subj6  technology2   0   7.0   4.0   1.0
20        subj6  technology3  20   6.0   1.0   0.0
21        subj6  technology4   6   8.0   7.0   3.0

作为奖励,就简化使用而言,这是我想要达到的目的。对于两种透视样式,它的工作方式相同。

for row in source_data_df_pvt1.itertuples():
    print(row)
    print(row.id)
    print(row.technology)
    print(row.Q1)
    print(row.Q2)
    print(row.Q3)
    print(row.Q4)

1 个答案:

答案 0 :(得分:0)

set_index使用unstack

df = (source_data_df.set_index(['id','technology','question'])['response']
                   .unstack(fill_value=0)
                   .reset_index())

但如果得到错误:

  

ValueError:索引包含重复的条目,无法重塑

这意味着三元组idtechnologyquestion中存在重复项,因此必须删除重复项或按first汇总:

source_data_df = source_data_df.drop_duplicates(['id','technology','question'])
df = (source_data_df.set_index(['id','technology','question'])['response']
                   .unstack(fill_value=0)
                   .reset_index())

与...相同:

df = pd.pivot_table(source_data_df, index = ['id','technology'],  
                                    columns = [ 'question'], 
                                    values = 'response', 
                                    aggfunc='first',
                                    fill_value=0).reset_index()
print (df)
question     id   technology  Q1  Q3  Q4
0         subj1  technology1   3   0   5
1         subj1  technology2   4   0   9
2         subj1  technology3   0   6   5
3         subj1  technology4   0   2   7
4         subj2  technology1   4   0   0
5         subj2  technology2   1   0   0
6         subj2  technology3   0   0   0