Question

我有两个数据帧：df_geo和df_event。我想在df_event中创建两个新列。数据帧类似于以下内容，尽管为简单起见已删除了其他列：

data_geo =  [['040','01','000','00000','00000','00000','Alabama'],
             ['050','01','001','00000','00000','00000','Autauga County'],
             ['050','01','097','00000','00000','00000','Mobile County'],
             ['050','01','101','00000','00000','00000','Montgomery County'],
             ['050','01','115','00000','00000','00000','St. Clair County'],
             ['040','09','000','00000','00000','00000','Connecticut'],
             ['061','09','001','04720','00000','00000','Bethel town'],
             ['040','17','000','00000','00000','00000','Illinois'],
             ['061','17','109','05638','00000','00000','Bethel township'],
             ['050','17','163','00000','00000','00000','St. Clair County']] 

dfgeo = pd.DataFrame(data_geo, columns = ['summary_level', 'state_fips','county_fips','subdivision_code_fips','place_code_fips','city_code_fips','area_name']) 

df_geo.info()

RangeIndex: 43847 entries, 0 to 43846
Data columns (total 7 columns):
summary_level            43847 non-null object
state_fips               43847 non-null object
county_fips              43847 non-null object
subdivision_code_fips    43847 non-null object
place_code_fips          43847 non-null object
city_code_fips           43847 non-null object
area_name                43847 non-null object

data_event = [['event_id','_','Alabama'], 
              ['event_id','_','Connecticut'],
              ['event_id','Autauga County','Alabama'],
              ['event_id','Fairfield County','Connecticut'],
              ['event_id','Fairbanks North Star Borough','Alaska']] 

df_event = pd.DataFrame(data_event, columns = ['event_id','county','state']) 

df_event.info()

RangeIndex: 1261 entries, 0 to 1260
Data columns (total 3 columns):
event_id                1261 non-null object
county                   999 non-null object
state                   1261 non-null object
dtypes: object(3)

目标以创建一个函数，该函数可以从county获取state和df_event输入，并在同一数据框中创建两个新列。新列基于state_fips中county_fips和df_geo的值。一个示例如下所示：

inputA fun('df_geo','Connecticut','Fairfield County'):   

resultA = ['event_id','Connecticut','Fairfield County','09','001']
                                                       ^New columns

inputB fun('df_geo','Alaska','Fairbanks North Star Borough'):   

resultB = ['event_id','Alaska','Fairbanks North Star Borough','02','090']
                                                              ^New columns

这是一个问题，因为我还需要在1200个（且还在不断增加的）事件列表中使用此函数，该事件必须在lamba函数或其他可以在其上映射的函数中起作用整个数据框。

由于在几个州出现的相同的县名称（例如“圣克莱尔县”），使情况变得复杂。尽管它们的area_names相同，但是state_fips的值将不同。

伊利诺伊州圣克莱尔的state_fips为 17 ，与伊利诺伊州的所有其他县和州本身相同。阿拉巴马州圣克莱尔市的state_fips为 01 ，与阿拉巴马州的所有其他县相同，依此类推...

我一直想使用相同的搜索和映射功能，直到city_code_fips。在该级别上，当我打算查找“ Bethel乡”时，所有搜索词都必须完全相同，以免出现“ Bethel乡”。确切的输入也很重要，因为像路易斯安那州这样的州，会用另一个名称来称呼他们的县级地理位置。

在df_event中，“ _”表示该县未知。

df_event['event_id']是唯一的字符串。数据框中有几乎相同的行，但具有不同的ID，表示事件已多次发生。这对没有影响。 state_fips或county_fips。

我知道这是一个多步骤过程，但是所有帮助都值得赞赏。谢谢。

Answer 1

您可以使用df.merge进行此操作：

./Nightlight.sh off 1

Answer 2

如果area_name列中有重复项，请先通过DataFrame.drop_duplicates将其删除：

dfgeo = dfgeo.drop_duplicates('area_name')

然后是Series.map，它比merge更快，因此应该更可取：

df_event['state_fips'] = df_event['state'].map(dfgeo.set_index('area_name')['state_fips'])
df_event['county_fips'] = df_event['county'].map(dfgeo.set_index('area_name')['county_fips'])
print (df_event)
  unique_str                        county        state state_fips county_fips
0   Event Id                             _      Alabama         01         NaN
1   Event Id                             _  Connecticut         09         NaN
2   Event Id                Autauga County      Alabama         01         001
3   Event Id              Fairfield County  Connecticut         09         001
4   Event Id  Fairbanks North Star Borough       Alaska         02         090

通过搜索另一个数据框来映射新列的值

2 个答案: