我有一个数据帧,它具有非唯一的GEO_ID,并且每个GEO_ID的单独列(6个值中的1个)中的属性(FTYPE)以及每个FTYPE的相关长度。
df
FID GEO_ID FTYPE Length_km
0 1400000US06001400100 428 3.291467766
1 1400000US06001400100 460 7.566487367
2 1400000US06001401700 460 0.262190266
3 1400000US06001401700 566 10.49899202
4 1400000US06001403300 428 0.138171389
5 1400000US06001403300 558 0.532913513
如何为FTYPE创建6个新列(1和0表示该行是否具有FTYPE),为FTYPE_Length创建6个新列以使每行具有唯一的GEO_ID?
我希望我的新数据帧具有这样的结构(使用6个FTYPE-s):
FID GEO_ID FTYPE_428 FTYPE_428_length FTYPE_460 FTYPE_460_length
0 1400000US06001400100 1 3.291467766 1 7.566487367
到目前为止,我尝试的是做这样的事情:
import pandas as pd
fname = "filename.csv"
df = pd.read_csv(fname)
nhd = [334, 336, 420, 428, 460, 558, 556]
df1 = df.loc[df['FTYPE']==nhd[0]]
df2 = df.loc[df['FTYPE']==nhd[1]]
df3 = df.loc[df['FTYPE']==nhd[2]]
df4 = df.loc[df['FTYPE']==nhd[3]]
df5 = df.loc[df['FTYPE']==nhd[4]]
df6 = df.loc[df['FTYPE']==nhd[5]]
df7 = df.loc[df['FTYPE']==nhd[6]]
df12 = df1.merge(df2, how='left', left_on='GEO_ID', right_on='GEO_ID')
df23 = df12.merge(df3,how='left', left_on='GEO_ID', right_on='GEO_ID')
df34 = df23.merge(df4,how='left', left_on='GEO_ID', right_on='GEO_ID')
df45 = df34.merge(df5,how='left', left_on='GEO_ID', right_on='GEO_ID')
df56 = df45.merge(df6,how='left', left_on='GEO_ID', right_on='GEO_ID')
df67 = df56.merge(df7,how='left', left_on='GEO_ID', right_on='GEO_ID')
cols = [0,4,7,10,13,16,19]
df67.drop(df67.columns[cols],axis=1,inplace=True)
df67.columns =['GEO_ID','334','len_334','336','len_336','420','len_420','428','len_428','460','len_460','558','len_558','566','len_566']
但是这种方法存在问题,因为它将行减少到具有前两个FTYPE-s的行。有没有办法一次合并多个列?
编写for循环可能更容易,并且遍历每一行并使用条件来填充这样的值:
nhd = [334, 336, 420, 428, 460, 558, 556]
for x in nhd:
df[str(x)] = None
df["length_"+str(x)] = None
df.head()
for geoid in df["GEO_ID"]:
#print geoid
for x in nhd:
df.ix[(df['FTYPE']==x) & (df['GEO_ID'] == geoid)][str(nhd)] = 1
但是这需要花费太多时间,而Pandas可能只有一个班轮来做同样的事情。
对此有任何帮助表示赞赏!
谢谢, 所罗门
答案 0 :(得分:1)
我看不到你的_length
列的重点:它们似乎具有相同的信息,而不仅仅是匹配值是否为空,这使它们变得多余。但是,它们很容易创造出来。
如果我们坚持的话,我们可以把它塞进一条线,但重点是什么?这是SO,而不是codegolf。所以我可能会这样做:
df = df.pivot(index="GEO_ID", columns="FTYPE", values="Length_km")
df.columns = "FTYPE_" + df.columns.astype(str)
has_value = df.notnull().astype(int)
has_value.columns += '_length'
final = pd.concat([df, has_value], axis=1).sort_index(axis='columns')
给了我(使用输入数据,只有5个不同的FTYPE):
In [49]: final
Out[49]:
FTYPE_334 FTYPE_334_length FTYPE_428 \
GEO_ID
1400000US06001400100 NaN 0 3.291468
1400000US06001401700 NaN 0 NaN
1400000US06001403300 NaN 0 0.138171
1400000US06001403400 0.04308 1 NaN
FTYPE_428_length FTYPE_460 FTYPE_460_length \
GEO_ID
1400000US06001400100 1 7.566487 1
1400000US06001401700 0 0.262190 1
1400000US06001403300 1 NaN 0
1400000US06001403400 0 NaN 0
FTYPE_558 FTYPE_558_length FTYPE_566 FTYPE_566_length
GEO_ID
1400000US06001400100 NaN 0 NaN 0
1400000US06001401700 NaN 0 10.498992 1
1400000US06001403300 0.532914 1 1.518864 1
1400000US06001403400 NaN 0 NaN 0