我有一个大的pandas数据框,其中有几列,但是让我们关注两个:
df = pd.DataFrame([['hey how are you', 'fine thanks',1],
['good to know', 'yes, and you',2],
['I am fine','ok',3],
['see you','bye!',4]],columns=list('ABC'))
df
出局:
A B C
0 hey how are you fine thanks 1
1 good to know yes, and you 2
2 I am fine ok 3
3 see you bye! 4
如何从上一个数据帧中将两个特定的列压缩到一个执行其他列值的单个pandas数据帧中?例如:
A C
0 hey how are you 1
1 fine thanks 1
2 good to know 2
3 yes, and you 2
4 I am fine 3
5 ok 3
6 see you 4
7 bye! 4
我试图:
df = df['A'].stack()
df = df.groupby(level=0)
df
但是,它不起作用。关于如何实现新格式的想法吗?
答案 0 :(得分:1)
您可以flatten()
(或reshape(-1, )
)DataFrame的value
s,它们存储为一个numpy数组:
pd.DataFrame(df.values.flatten(), columns=['A'])
A
0 hey how are you
1 fine thanks
2 good to know
3 yes, and you
4 I am fine
5 ok
6 see you
7 bye!
注释: np.ndarray.flatten
和np.ndarray.reshape
的默认行为是您想要的,它的更改列索引的速度要快于原始数组中的行索引。这就是所谓的行优先(C风格)顺序。要比行索引更快地更改行索引,请传递order='F'
以进行以列为主的Fortran样式排序。文件:https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.flatten.html
答案 1 :(得分:1)
这将删除列名,但完成工作:
import pandas as pd
df = pd.DataFrame([['hey how are you', 'fine thanks'],
['good to know', 'yes, and you'],
['I am fine','ok'],
['see you','bye!']],columns=list('AB'))
df.stack().reset_index(drop=True)
0 hey how are you
1 fine thanks
2 good to know
3 yes, and you
4 I am fine
5 ok
6 see you
7 bye!
dtype: object
默认的堆栈行为保留列名:
df.stack()
0 A hey how are you
B fine thanks
1 A good to know
B yes, and you
2 A I am fine
B ok
3 A see you
B bye!
dtype: object
如果有更多列,则可以选择要堆叠的列,只需使用列索引即可:
df[["A", "B"]].stack()
使用额外的列,事情变得棘手,您需要通过降低一个级别(包含列)来对齐索引:
df["C"] = range(4)
stacked = df[["A", "B"]].stack()
stacked.index = stacked.index.droplevel(level=1)
stacked
0 hey how are you
0 fine thanks
1 good to know
1 yes, and you
2 I am fine
2 ok
3 see you
3 bye!
dtype: object
现在,我们可以与C
列进行合并:
pd.concat([stacked, df["C"]], axis=1)
0 C
0 hey how are you 0
0 fine thanks 0
1 good to know 1
1 yes, and you 1
2 I am fine 2
2 ok 2
3 see you 3
3 bye! 3
答案 2 :(得分:-2)
您可能需要的是pandas.concat
。
它接受“系列,DataFrame或Panel对象的序列或映射”,因此您可以传递list
对象中的DataFrame
以选择列(将为pd.Series
当为单个列编制索引时。
df3 = pd.concat([df['A'], df['B']])