Question

我有不平衡的数据框，我尝试在取消堆叠数据之前先使数据平衡，关键点是len(df.Question == "Q007_C02")是新数据行的数量，所以如果df.Question的任何级别大于数字在df.Question == "Q007_C02"的行中，我只将第一行添加到len(df.Question == "Q007_C02")，如果df.Question小于df.Question == "Q007_C02"的行数，我需要重复，然后取消堆叠数据或转置

df = pd.DataFrame({"Question":["Q007_A00","Q007_B00","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C02","Q007_C02","Q007_C02","Q007_C02","Q007_C02"],
               "Key": ["Y","N",1,4,5,2,8,9,3,"Text 1","Text 2","Text 3","Text 4","Text 5"]})
df

    Key Question
0   Y   Q007_A00
1   N   Q007_B00
2   1   Q007_C01
3   4   Q007_C01
4   5   Q007_C01
5   2   Q007_C01
6   8   Q007_C01
7   9   Q007_C01
8   3   Q007_C01
9   Text 1  Q007_C02
10  Text 2  Q007_C02
11  Text 3  Q007_C02
12  Text 4  Q007_C02
13  Text 5  Q007_C02

你可以看到len(df.Question == "Q007_C02") = 5，所以使用5作为数据行数的默认值，我想要的输出。

  Q007_A00  Q007_B00    Q007_C01    Q007_C02
0   Y          N            1        Text 1
1   Y          N            4        Text 2
2   Y          N            5        Text 3
3   Y          N            2        Text 4
4   Y          N            8        Text 5

Answer 1

这是适用于您的样本数据的解决方案。

import pandas as pd

df = pd.DataFrame({"Question":["Q007_A00","Q007_B00","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C01","Q007_C02","Q007_C02","Q007_C02","Q007_C02","Q007_C02"],
               "Key": ["Y","N",1,4,5,2,8,9,3,"Text 1","Text 2","Text 3","Text 4","Text 5"]})

#create a new index column which based on which row each item should occupy in the balanced table
df = df.sort_values('Question')  #the dataframe must be sorted for this to work
new_index = []
for c in df.groupby('Question')['Key'].count():
    new_index.extend(range(c))
# for the example code, new_index is this list [0, 0, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4]

balanced = df.set_index([new_index, 'Question']) #set the dataframe index to have two levels, index and Question
balanced = balanced.unstack()                    #unstack on the last index level, which is Question
balanced.columns = balanced.columns.droplevel(0) #the column index is a MultiIndex of (Key, Question), remove the top level
balanced = balanced.dropna(subset=['Q007_C02'])  #limits the dataframe to the number of rows in column Q007_C02
balanced = balanced.fillna(method='ffill')       #fill missing values based on the last valid value

使用unstack()的关键是创建一个索引，其中包含平衡数据框中每个条目的行的值。 for循环基于每个count() df.Keys的{{1}}创建此新索引。一旦有了这个索引，剩下的就是操纵数据帧以获得所需的结构。

我觉得可能有更好的方法来获取索引，但我现在想不起来。

如何在Python中复制或删除具有条件的行

1 个答案: