熊猫:将df1 ['a']中的字符串与df2 ['a]匹配,并将对应的值df2 ['b']分配给df1 ['new']

时间:2020-03-18 00:18:03

标签: python pandas dataframe

背景

我正在查看CSSEGISandData(github.com/CSSEGISandData/COVID-19.git)中的冠状病毒数据集。我正在尝试创建一个情节总体图,以显示每个美国县的病例数

这是CSSEGISandData中的csv数据集的示例。我已将多天串联到一个文件中:

         Province/State Country/Region         Last Update  Confirmed  \
259               Chicago             US 2020-01-24 17:00:00        1.0   
3028           Orange, CA             US 2020-02-01 19:53:00        1.0   
2445       San Benito, CA             US 2020-02-03 03:53:02        2.0   
3181      San Antonio, TX             US 2020-02-13 18:53:02        1.0   
4762  Humboldt County, CA             US 2020-02-21 05:13:09        1.0   

      Deaths  Recovered  Latitude  Longitude            file  \
259      0.0        0.0       NaN        NaN  01-24-2020.csv   
3028     0.0        0.0       NaN        NaN  02-01-2020.csv   
2445     0.0        0.0       NaN        NaN  02-24-2020.csv   
3181     0.0        0.0   29.4241   -98.4936  03-04-2020.csv   
4762     0.0        0.0       NaN        NaN  02-27-2020.csv  

问题

我想修改此示例(https://plot.ly/python/mapbox-county-choropleth/)并使用数据框中的县,为此,我首先需要:

  1. 将文件中的所有县匹配到fips代码。

我在这里(https://github.com/kjhealy/fips-codes)找到了fips代码列表:

   fips            name state
0     0   UNITED STATES   NaN
1  1000         ALABAMA   NaN
2  1001  Autauga County    AL
3  1003  Baldwin County    AL
4  1005  Barbour County    AL

如何使用正确的fips代码在熊猫数据框中创建新列?

这是我的代码,用于导入COVID数据和fips代码

!git clone https://github.com/CSSEGISandData/COVID-19.git

#@title Import and Option to show print more data 
import pandas as pd 
import glob


#Get the coronavirus data for the US 
path = r'/content/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports' # use your path
all_files = glob.glob(path + "/*.csv") #collect all files in one
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)
df = pd.concat(li, axis=0, ignore_index=True, sort=False) #one dataframe
filter_USA=frame['Country/Region']=='US'
USA= frame[filter_USA]
print (USA.head())


#Get the county data for the US with fips 
county_url= 'https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_and_county_fips_master.csv'
county = pd.read_csv(county_url)
print ( county.head()) 

现在,我需要将美国省/州名与县名匹配,并指定fips值。

#somethin like 
for all USA['Province/State'] match to county.name
USA['fips'] = county match fips value

编辑

在数据框中添加更长的版本,以显示名称的不同问题:

  • 大小写问题
  • 有些是城市名称而不是县
  • 有些人给他们增加了文字
3027                                Los Angeles, CA
2369                                Santa Clara, CA
2003                                 San Benito, CA
2310                                    Madison, WI
2470                                    Seattle, WA
6175                                    Chicago, IL
2237                           San Diego County, CA
2805                           San Diego County, CA
1765                                San Antonio, TX
3657                            Humboldt County, CA
737                                 Santa Clara, CA
3629                           San Diego County, CA
2468                          Sacramento County, CA
1543                                    Ashland, NE
1549                                     Travis, CA
1560                                   Lackland, TX
420            Lackland, TX (From Diamond Princess)
410              Travis, CA (From Diamond Princess)
404               Omaha, NE (From Diamond Princess)
6436              Omaha, NE (From Diamond Princess)
4289           Lackland, TX (From Diamond Princess)
5047             Travis, CA (From Diamond Princess)
2421    Unassigned Location (From Diamond Princess)
1769                                      Tempe, AZ
303     Unassigned Location (From Diamond Princess)
4981    Unassigned Location (From Diamond Princess)
5015                          Sacramento County, CA
4208    Unassigned Location (From Diamond Princess)

0 个答案:

没有答案