我有一张看起来像这样的桌子。完整的示例代码和源数据位于底部,包括提供的两个答案。
id technology question response
0 subj1 technology1 Q1 3
1 subj1 technology2 Q1 4
...
10 subj1 technology3 Q3 6
11 subj1 technology4 Q3 2
12 subj1 technology4 Q4 7
13 subj1 technology3 Q4 5
14 subj1 technology1 Q4 5
15 subj1 technology2 Q4 9
16 subj2 technology2 Q1 1
17 subj2 technology1 Q1 4
...
29 subj2 technology3 Q4 0
我想要的是一张表格,其中包含“问题”的不同值。列成为自己的列,并且问题列单元格中的值是给定主题和技术的给定问题的响应值,如下所示(仅举例说明):
id technology Q1 Q2 Q3 Q4
0 subj1 technology1 3 3 2 1
1 subj1 technology2 4 4 3 1
...
10 subj1 technology3 6 3 7 2
...
16 subj2 technology2 4 5 7 3
如果我像这样转动表格,我可以接近这一点(根据目前的建议,请注意以下完整代码中的改进版本):
source_data_df_pvt1 = pd.pivot_table(source_data_df, index = ['id'],
columns = ['technology', 'question'],
values = 'response', aggfunc='first')
这给了我这个多维数据框:
technology technology1 technology2 technology3 technology4 technology5
question Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1
id
subj1 3.0 9.0 7.0 5.0 4.0 5.0 3.0 9.0 3.0 8.0 6.0 5.0 5.0 8.0 2.0 7.0 NaN
subj2 4.0 9.0 8.0 7.0 1.0 5.0 8.0 20.0 20.0 9.0 4.0 0.0 3.0 0.0 8.0 6.0 NaN
subj3 14.0 NaN 10.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15.0
subj4 13.0 4.0 5.0 11.0 17.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN NaN
subj5 3.0 20.0 4.0 8.0 2.0 20.0 3.0 2.0 3.0 5.0 7.0 5.0 4.0 2.0 7.0 5.0 NaN
subj6 2.0 8.0 1.0 6.0 0.0 7.0 4.0 1.0 20.0 6.0 1.0 0.0 6.0 8.0 7.0 3.0 NaN
我不希望我的数据帧是多维的,我只想要一个维度表。
Pandas可以吗?
完整的示例代码和输出,包括工作解决方案:
import pandas as pd
import numpy as np
pd.set_option('display.width', 1000)
#See https://gist.github.com/NathanDotTo/e506c1946c23234d2c24a2bd27e570a0#file-technology_skills-csv
sample_data = "../test_data/technology_skills.csv"
#The sample data has these columns:
column_names = ["id", "technology", "question", "response"]
source_data_df = pd.read_csv(sample_data, names = column_names, header = None)
#Change response to be numeric
pd.to_numeric(source_data_df.response)
#Strip white spaces from questions
source_data_df['question'] = source_data_df['question'].str.strip()
#Pivot the table to create columns from response values for each question
source_data_df_pvt1 = pd.pivot_table(source_data_df, index = ['id','technology'],
columns = [ 'question'],
values = 'response',
aggfunc='first',
fill_value=np.nan).reset_index()
print('source_data_df_pvt1 *******************')
print (source_data_df_pvt1)
source_data_df_pvt2 = source_data_df.pivot_table(index=['id','technology'],
columns='question',
values='response',
aggfunc='sum',
fill_value=np.nan).reset_index()
print('source_data_df_pvt2 *******************')
print (source_data_df_pvt2)
结果如下:
source_data_df_pvt1 *******************
question id technology Q1 Q2 Q3 Q4
0 subj1 technology1 3 9.0 7.0 5.0
1 subj1 technology2 4 5.0 3.0 9.0
2 subj1 technology3 3 8.0 6.0 5.0
3 subj1 technology4 5 8.0 2.0 7.0
4 subj2 technology1 4 9.0 8.0 7.0
5 subj2 technology2 1 5.0 8.0 20.0
6 subj2 technology3 20 9.0 4.0 0.0
7 subj2 technology4 3 0.0 8.0 6.0
8 subj3 technology1 14 NaN 10.0 0.0
9 subj3 technology5 15 NaN NaN NaN
10 subj4 technology1 13 4.0 5.0 11.0
11 subj4 technology2 17 NaN NaN NaN
12 subj4 technology3 0 NaN NaN NaN
13 subj4 technology4 0 NaN NaN NaN
14 subj5 technology1 3 20.0 4.0 8.0
15 subj5 technology2 2 20.0 3.0 2.0
16 subj5 technology3 3 5.0 7.0 5.0
17 subj5 technology4 4 2.0 7.0 5.0
18 subj6 technology1 2 8.0 1.0 6.0
19 subj6 technology2 0 7.0 4.0 1.0
20 subj6 technology3 20 6.0 1.0 0.0
21 subj6 technology4 6 8.0 7.0 3.0
source_data_df_pvt2 *******************
question id technology Q1 Q2 Q3 Q4
0 subj1 technology1 3 9.0 7.0 5.0
1 subj1 technology2 4 5.0 3.0 9.0
2 subj1 technology3 3 8.0 6.0 5.0
3 subj1 technology4 5 8.0 2.0 7.0
4 subj2 technology1 4 9.0 8.0 7.0
5 subj2 technology2 1 5.0 8.0 20.0
6 subj2 technology3 20 9.0 4.0 0.0
7 subj2 technology4 3 0.0 8.0 6.0
8 subj3 technology1 14 NaN 10.0 0.0
9 subj3 technology5 15 NaN NaN NaN
10 subj4 technology1 13 4.0 5.0 11.0
11 subj4 technology2 17 NaN NaN NaN
12 subj4 technology3 0 NaN NaN NaN
13 subj4 technology4 0 NaN NaN NaN
14 subj5 technology1 3 20.0 4.0 8.0
15 subj5 technology2 2 20.0 3.0 2.0
16 subj5 technology3 3 5.0 7.0 5.0
17 subj5 technology4 4 2.0 7.0 5.0
18 subj6 technology1 2 8.0 1.0 6.0
19 subj6 technology2 0 7.0 4.0 1.0
20 subj6 technology3 20 6.0 1.0 0.0
21 subj6 technology4 6 8.0 7.0 3.0
作为奖励,就简化使用而言,这是我想要达到的目的。对于两种透视样式,它的工作方式相同。
for row in source_data_df_pvt1.itertuples():
print(row)
print(row.id)
print(row.technology)
print(row.Q1)
print(row.Q2)
print(row.Q3)
print(row.Q4)
答案 0 :(得分:0)
df = (source_data_df.set_index(['id','technology','question'])['response']
.unstack(fill_value=0)
.reset_index())
但如果得到错误:
ValueError:索引包含重复的条目,无法重塑
这意味着三元组id
,technology
,question
中存在重复项,因此必须删除重复项或按first
汇总:
source_data_df = source_data_df.drop_duplicates(['id','technology','question'])
df = (source_data_df.set_index(['id','technology','question'])['response']
.unstack(fill_value=0)
.reset_index())
与...相同:
df = pd.pivot_table(source_data_df, index = ['id','technology'],
columns = [ 'question'],
values = 'response',
aggfunc='first',
fill_value=0).reset_index()
print (df)
question id technology Q1 Q3 Q4
0 subj1 technology1 3 0 5
1 subj1 technology2 4 0 9
2 subj1 technology3 0 6 5
3 subj1 technology4 0 2 7
4 subj2 technology1 4 0 0
5 subj2 technology2 1 0 0
6 subj2 technology3 0 0 0