用Pandas重塑CSV:加入DF的两个子集

时间:2019-04-08 07:58:03

标签: python pandas csv dataframe reshape

我的.csv如下:

   Res          X      XB          XC           O       P
  A312      76.55     -           -           -       -  
  B313      175.4   62.28       32.62       8.189   121.2
  J314      176.5   53.34       40.77       8.277   124.6
  L315      177.9   55.29       41.44       8.427   125.5
  T316      174.7   59.47       63.43       8.264   116.1
  ...
  G378      10.2    58.91       40.13       7.646   126.7 

我想像这样重塑它:

   312 A   X   76.55
   313 B   X   175.4
   313 B   XB  62.28
   313 B   XC  32.62
   ...
   378 G   O   7.646
   378 G   P   126.7
import pandas as pd

df1 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")
df2 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")

df1['Pos'] = df1['Res'].str[1:].astype(int)
df1['AA'] = df1['Res'].str[0]
df2.drop('Res', axis = 1, inplace = True)
a = df2.stack(level = -1)
b = df1[["Pos", "AA"]]
print(a)
print(b)

这将产生:

print(a)的输出:

0   X      76.500
1   X     175.400
    XB     62.280
    XC     32.620
    O       8.189
    P     121.200
...
62  X      10.200
    XB     58.910
    XC     40.130
    O       7.646
    P     126.700

print(b)的输出:

0   312  A
1   313  B
2   314  J
3   315  L
...
62  378  G

关于如何进行最后一步的任何构想,即加入这两个df ab,最终实现我想要的格式?我已经尝试了几种pandas功能,例如pd.mergepd.joinpd.concat。这些似乎都不起作用...

2 个答案:

答案 0 :(得分:1)

您要melt

import pandas as pd

df = pd.read_csv("my_file.csv", delim_whitespace=True, index_col=False)

df['Res'] = df['Res'].str[0]
reshaped = df.melt('Res', ['X', 'XB', 'XC', 'O', 'P'])
print(reshaped.dropna().sort_values('Res').reset_index(drop=True))

输出:

   Res variable  value
0    A        X  76.55
1    B        O  8.189
2    B        P  121.2
3    B        X  175.4
4    B       XB  62.28
5    B       XC  32.62
6    J        O  8.277
7    J        P  124.6
8    J        X  176.5
9    J       XB  53.34
10   J       XC  40.77
11   L        O  8.427
12   L        P  125.5
13   L        X  177.9
14   L       XB  55.29
15   L       XC  41.44
16   T        O  8.264
17   T        P  116.1
18   T        X  174.7
19   T       XB  59.47
20   T       XC  63.43

答案 1 :(得分:1)

您的解决方案有所改变-首先为提取列添加DataFrame.pop-然后不需要df1.drop('Res', axis = 1, inplace = True),然后通过DataFrame.set_index创建MultiIndex并调用DataFrame.stack ,最后一次数据清除-reset_indexrename

df1 = pd.read_csv("my_file.csv", delim_whitespace = True, index_col = False, na_values = "-")

df1['Pos'] = df1['Res'].str[1:].astype(int)
df1['AA'] = df1.pop('Res').str[0]

df = (df1.set_index(['Pos', 'AA'])
         .stack()
         .reset_index(name='new')
         .rename(columns={'level_2':'cat'}))

print (df)
    Pos AA cat      new
0   312  A   X   76.550
1   313  B   X  175.400
2   313  B  XB   62.280
3   313  B  XC   32.620
4   313  B   O    8.189
5   313  B   P  121.200
6   314  J   X  176.500
7   314  J  XB   53.340
8   314  J  XC   40.770
9   314  J   O    8.277
10  314  J   P  124.600
11  315  L   X  177.900
12  315  L  XB   55.290
13  315  L  XC   41.440
14  315  L   O    8.427
15  315  L   P  125.500
16  316  T   X  174.700
17  316  T  XB   59.470
18  316  T  XC   63.430
19  316  T   O    8.264
20  316  T   P  116.100
21  378  G   X   10.200
22  378  G  XB   58.910
23  378  G  XC   40.130
24  378  G   O    7.646
25  378  G   P  126.700