这是指在使用SAS之前回答的问题。 SAS - transpose multiple variables in rows to columns
新的事情是变量的长度不是两个而是变化的。这是一个例子:
acct la ln seq1 seq2
0 9999 20.01 100 1 10
1 9999 19.05 1 1 10
2 9999 30.00 1 1 10
3 9999 26.77 100 2 11
4 9999 24.96 1 2 11
5 8888 38.43 218 3 20
6 8888 37.53 1 3 20
我想要的输出是:
acct la ln seq1 seq2 la0 la1 la2 la3 ln0 ln1 ln2
5 8888 38.43 218 3 20 38.43 37.53 NaN NaN 218 1 NaN
0 9999 20.01 100 1 10 20.01 19.05 30 NaN 100 1 1
3 9999 26.77 100 2 11 26.77 24.96 NaN NaN 100 1 NaN
在SAS中,我可以使用相当简单的proc摘要,但是我想用Python完成它,因为我不能再使用SAS了。
我已经解决了我可以为我的问题重复使用的问题,但我想知道在Pandas中是否有一个我没有看到的更容易的选项。这是我的解决方案。如果某人有更快的方法会很有趣!
# write multiple row to col based on groupby
import pandas as pd
from pandas import DataFrame
import numpy as np
data = DataFrame({
"acct": [9999, 9999, 9999, 9999, 9999, 8888, 8888],
"seq1": [1, 1, 1, 2, 2, 3, 3],
"seq2": [10, 10, 10, 11, 11, 20, 20],
"la": [20.01, 19.05, 30, 26.77, 24.96, 38.43, 37.53],
"ln": [100, 1, 1, 100, 1, 218, 1]
})
# group the variables by some classes
grouped = data.groupby(["acct", "seq1", "seq2"])
def rows_to_col(column, size):
# create head and contain to iterate through the groupby values
head = []
contain = []
for i,j in grouped:
head.append(i)
contain.append(j)
# transpose the values in contain
contain_transpose = []
for i in range(0,len(contain)):
contain_transpose.append(contain[i][column].tolist())
# determine the longest list of a sublist
length = len(max(contain_transpose, key = len))
# assign missing values to sublist smaller than longest list
for i in range(0, len(contain_transpose)):
if len(contain_transpose[i]) != length:
contain_transpose[i].append("NaN" * (length - len(contain_transpose[i])))
# create columns for the transposed column values
for i in range(0, len(contain)):
for j in range(0, size):
contain[i][column + str(j)] = np.nan
# assign the transposed values to the column
for i in range(0, len(contain)):
for j in range(0, length):
contain[i][column + str(j)] = contain_transpose[i][j]
# now always take the first values of the grouped group
concat_list = []
for i in range(0, len(contain)):
concat_list.append(contain[i][:1])
return pd.concat(concat_list) # concate the list
# fill in column name and expected size of the column
data_la = rows_to_col("la", 4)
data_ln = rows_to_col("ln", 3)
# merge the two data frames together
cols_use = data_ln.columns.difference(data_la.columns)
data_final = pd.merge(data_la, data_ln[cols_use], left_index=True, right_index=True, how="outer")
data_final.drop(["la", "ln"], axis = 1)
答案 0 :(得分:1)
请注意:
In [58]:
print grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack()
0 1 2
acct seq1 seq2
8888 3 20 38.43 37.53 NaN
9999 1 10 20.01 19.05 30
2 11 26.77 24.96 NaN
和
In [59]:
print grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()
0 1 2
acct seq1 seq2
8888 3 20 218 1 NaN
9999 1 10 100 1 1
2 11 100 1 NaN
因此:
In [60]:
df2 = pd.concat((grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack(),
grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()),
keys= ['la', 'ln'], axis=1)
print df2
la ln
0 1 2 0 1 2
acct seq1 seq2
8888 3 20 38.43 37.53 NaN 218 1 NaN
9999 1 10 20.01 19.05 30 100 1 1
2 11 26.77 24.96 NaN 100 1 NaN
唯一的问题是列索引是MultiIndex
。如果我们不想要它,我们可以通过以下方式将它们转换为la0....
df2.columns = map(lambda x: x[0]+str(x[1]), df2.columns.tolist())
我不知道你在想什么。但我更喜欢SAS
PROC TRANSPOSE
语法以提高可读性。在这种特殊情况下,Pandas
语法简洁但可读性较差。