Python Pandas:merge,join,concat

时间:2017-12-09 04:51:56

标签: pandas join merge concat

我有一个数据帧,它具有非唯一的GEO_ID,并且每个GEO_ID的单独列(6个值中的1个)中的属性(FTYPE)以及每个FTYPE的相关长度。

df

    FID GEO_ID  FTYPE   Length_km

0   1400000US06001400100    428 3.291467766

1   1400000US06001400100    460 7.566487367

2   1400000US06001401700    460 0.262190266

3   1400000US06001401700    566 10.49899202

4   1400000US06001403300    428 0.138171389

5   1400000US06001403300    558 0.532913513

如何为FTYPE创建6个新列(1和0表示该行是否具有FTYPE),为FTYPE_Length创建6个新列以使每行具有唯一的GEO_ID?

我希望我的新数据帧具有这样的结构(使用6个FTYPE-s):

FID GEO_ID  FTYPE_428   FTYPE_428_length    FTYPE_460   FTYPE_460_length
0   1400000US06001400100    1   3.291467766 1   7.566487367

到目前为止,我尝试的是做这样的事情:

import pandas as pd
fname = "filename.csv"
df = pd.read_csv(fname)
nhd = [334, 336, 420, 428, 460, 558, 556]
df1 = df.loc[df['FTYPE']==nhd[0]]
df2 = df.loc[df['FTYPE']==nhd[1]]
df3 = df.loc[df['FTYPE']==nhd[2]]
df4 = df.loc[df['FTYPE']==nhd[3]]
df5 = df.loc[df['FTYPE']==nhd[4]]
df6 = df.loc[df['FTYPE']==nhd[5]]
df7 = df.loc[df['FTYPE']==nhd[6]]
df12 = df1.merge(df2, how='left', left_on='GEO_ID', right_on='GEO_ID')
df23 = df12.merge(df3,how='left', left_on='GEO_ID', right_on='GEO_ID')
df34 = df23.merge(df4,how='left', left_on='GEO_ID', right_on='GEO_ID')
df45 = df34.merge(df5,how='left', left_on='GEO_ID', right_on='GEO_ID')
df56 = df45.merge(df6,how='left', left_on='GEO_ID', right_on='GEO_ID')
df67 = df56.merge(df7,how='left', left_on='GEO_ID', right_on='GEO_ID')
cols = [0,4,7,10,13,16,19]
df67.drop(df67.columns[cols],axis=1,inplace=True)
df67.columns =['GEO_ID','334','len_334','336','len_336','420','len_420','428','len_428','460','len_460','558','len_558','566','len_566']

但是这种方法存在问题,因为它将行减少到具有前两个FTYPE-s的行。有没有办法一次合并多个列?

编写for循环可能更容易,并且遍历每一行并使用条件来填充这样的值:

nhd = [334, 336, 420, 428, 460, 558, 556]
for x in nhd:
    df[str(x)] = None
    df["length_"+str(x)] = None
df.head()
for geoid in df["GEO_ID"]:
    #print geoid
    for x in nhd:
        df.ix[(df['FTYPE']==x) & (df['GEO_ID'] == geoid)][str(nhd)] = 1

但是这需要花费太多时间,而Pandas可能只有一个班轮来做同样的事情。

对此有任何帮助表示赞赏!

谢谢, 所罗门

1 个答案:

答案 0 :(得分:1)

我看不到你的_length列的重点:它们似乎具有相同的信息,而不仅仅是匹配值是否为空,这使它们变得多余。但是,它们很容易创造出来。

如果我们坚持的话,我们可以把它塞进一条线,但重点是什么?这是SO,而不是codegolf。所以我可能会这样做:

df = df.pivot(index="GEO_ID", columns="FTYPE", values="Length_km")
df.columns = "FTYPE_" + df.columns.astype(str)

has_value = df.notnull().astype(int)
has_value.columns += '_length'

final = pd.concat([df, has_value], axis=1).sort_index(axis='columns')

给了我(使用输入数据,只有5个不同的FTYPE):

In [49]: final
Out[49]: 
                      FTYPE_334  FTYPE_334_length  FTYPE_428  \
GEO_ID                                                         
1400000US06001400100        NaN                 0   3.291468   
1400000US06001401700        NaN                 0        NaN   
1400000US06001403300        NaN                 0   0.138171   
1400000US06001403400    0.04308                 1        NaN   

                      FTYPE_428_length  FTYPE_460  FTYPE_460_length  \
GEO_ID                                                                
1400000US06001400100                 1   7.566487                 1   
1400000US06001401700                 0   0.262190                 1   
1400000US06001403300                 1        NaN                 0   
1400000US06001403400                 0        NaN                 0   

                      FTYPE_558  FTYPE_558_length  FTYPE_566  FTYPE_566_length  
GEO_ID                                                                          
1400000US06001400100        NaN                 0        NaN                 0  
1400000US06001401700        NaN                 0  10.498992                 1  
1400000US06001403300   0.532914                 1   1.518864                 1  
1400000US06001403400        NaN                 0        NaN                 0