我有不平衡的数据框,我尝试在取消堆叠数据之前先使数据平衡,关键点是len(df.Question == "Q007_C02")
是新数据行的数量,所以如果df.Question
的任何级别大于数字在df.Question == "Q007_C02"
的行中,我只将第一行添加到len(df.Question == "Q007_C02")
,如果df.Question
小于df.Question == "Q007_C02"
的行数,我需要重复,然后取消堆叠数据或转置
df = pd.DataFrame({"Question":["Q007_A00","Q007_B00","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C02","Q007_C02","Q007_C02","Q007_C02","Q007_C02"],
"Key": ["Y","N",1,4,5,2,8,9,3,"Text 1","Text 2","Text 3","Text 4","Text 5"]})
df
Key Question
0 Y Q007_A00
1 N Q007_B00
2 1 Q007_C01
3 4 Q007_C01
4 5 Q007_C01
5 2 Q007_C01
6 8 Q007_C01
7 9 Q007_C01
8 3 Q007_C01
9 Text 1 Q007_C02
10 Text 2 Q007_C02
11 Text 3 Q007_C02
12 Text 4 Q007_C02
13 Text 5 Q007_C02
你可以看到len(df.Question == "Q007_C02")
= 5,所以使用5作为数据行数的默认值,我想要的输出。
Q007_A00 Q007_B00 Q007_C01 Q007_C02
0 Y N 1 Text 1
1 Y N 4 Text 2
2 Y N 5 Text 3
3 Y N 2 Text 4
4 Y N 8 Text 5
答案 0 :(得分:1)
这是适用于您的样本数据的解决方案。
import pandas as pd
df = pd.DataFrame({"Question":["Q007_A00","Q007_B00","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C02","Q007_C02","Q007_C02","Q007_C02","Q007_C02"],
"Key": ["Y","N",1,4,5,2,8,9,3,"Text 1","Text 2","Text 3","Text 4","Text 5"]})
#create a new index column which based on which row each item should occupy in the balanced table
df = df.sort_values('Question') #the dataframe must be sorted for this to work
new_index = []
for c in df.groupby('Question')['Key'].count():
new_index.extend(range(c))
# for the example code, new_index is this list [0, 0, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4]
balanced = df.set_index([new_index, 'Question']) #set the dataframe index to have two levels, index and Question
balanced = balanced.unstack() #unstack on the last index level, which is Question
balanced.columns = balanced.columns.droplevel(0) #the column index is a MultiIndex of (Key, Question), remove the top level
balanced = balanced.dropna(subset=['Q007_C02']) #limits the dataframe to the number of rows in column Q007_C02
balanced = balanced.fillna(method='ffill') #fill missing values based on the last valid value
使用unstack()
的关键是创建一个索引,其中包含平衡数据框中每个条目的行的值。 for
循环基于每个count()
df.Keys
的{{1}}创建此新索引。一旦有了这个索引,剩下的就是操纵数据帧以获得所需的结构。
我觉得可能有更好的方法来获取索引,但我现在想不起来。