基于多个列在两个具有不同形状的数据框之间减去多个列

时间:2020-03-27 22:58:34

标签: python pandas

我正在查看JHU的以下三个数据集

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv

采用哪种形式

 'Province/State   'Country/Region    'Lat'    'Long'   '1/22/20'    '1/23/20' ...
       NaN               Italy          x        y          0            0

我想根据公式active = confirmed - (recovered+deahts)计算每个省,国家和地区的活跃病例数

在数据集具有相同形状之前,我可以执行以下操作

df_active = df_confirmed.copy()
df_active.loc[4:] = df_confirmed.loc[4:]-(df_recovered.loc[4:]+df_deaths.loc[4:])

现在它们不包含相同国家/地区的数据,并且日期列的数量也不总是相同。

所以我需要做以下

1)确定所有3个DF共有哪些日期列,

2)在省和国家/地区列匹配的地方,执行active = confirmed - (recovered+deahts)

对于第1点),我可以执行以下操作

## append all shape[1] to list
df_shape_list.append(df_confirmed.shape[1])
...  
min_common_columns = min(df_shape_list)

所以我需要减去4:min_common_columns列,但是在所有3个DF的省和国家列都匹配的情况下该怎么做?

1 个答案:

答案 0 :(得分:1)

请考虑将melt的宽数据转换为长格式,然后再将merge转换为位置和日期。然后运行所需的公式:

from functools import reduce
import pandas as pd

df_confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
                           "csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

df_deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
                        "csse_covid_19_time_series/time_series_covid19_deaths_global.csv")

df_recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
                           "csse_covid_19_time_series/time_series_covid19_recovered_global.csv")


# MELT EACH DF IN LIST COMPREHENSION
df_list = [df.melt(id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long'],
                   var_name = 'Date', value_name = val) 
           for df, val in zip([df_confirmed, df_deaths, df_recovered], 
                              ['confirmed', 'deaths', 'recovered'])]

# CHAIN MERGE
df_long = reduce(lambda x,y: pd.merge(x, y, on=['Province/State', 'Country/Region', 'Lat', 'Long', 'Date']),
                 df_list)

# SIMPLE ARITHMETIC
df_long['active'] = df_long['confirmed'] - (df_long['recovered'] + df_long['deaths'])

输出 (按有效降序排列)

df_long.sort_values(['active'], ascending=False).head(10)

#       Province/State Country/Region      Lat     Long     Date  confirmed  deaths  recovered  active
# 15229            NaN             US  37.0902 -95.7129  3/27/20     101657    1581        869   99207
# 14998            NaN             US  37.0902 -95.7129  3/26/20      83836    1209        681   81946
# 15141            NaN          Italy  43.0000  12.0000  3/27/20      86498    9134      10950   66414
# 14767            NaN             US  37.0902 -95.7129  3/25/20      65778     942        361   64475
# 14910            NaN          Italy  43.0000  12.0000  3/26/20      80589    8215      10361   62013
# 14679            NaN          Italy  43.0000  12.0000  3/25/20      74386    7503       9362   57521
# 14448            NaN          Italy  43.0000  12.0000  3/24/20      69176    6820       8326   54030
# 14536            NaN             US  37.0902 -95.7129  3/24/20      53740     706        348   52686
# 15205            NaN          Spain  40.0000  -4.0000  3/27/20      65719    5138       9357   51224
# 14217            NaN          Italy  43.0000  12.0000  3/23/20      63927    6077       7024   50826