我正在查看JHU的以下三个数据集
采用哪种形式
'Province/State 'Country/Region 'Lat' 'Long' '1/22/20' '1/23/20' ...
NaN Italy x y 0 0
我想根据公式active = confirmed - (recovered+deahts)
计算每个省,国家和地区的活跃病例数
在数据集具有相同形状之前,我可以执行以下操作
df_active = df_confirmed.copy()
df_active.loc[4:] = df_confirmed.loc[4:]-(df_recovered.loc[4:]+df_deaths.loc[4:])
现在它们不包含相同国家/地区的数据,并且日期列的数量也不总是相同。
所以我需要做以下
1)确定所有3个DF共有哪些日期列,
2)在省和国家/地区列匹配的地方,执行active = confirmed - (recovered+deahts)
对于第1点),我可以执行以下操作
## append all shape[1] to list
df_shape_list.append(df_confirmed.shape[1])
...
min_common_columns = min(df_shape_list)
所以我需要减去4:min_common_columns
列,但是在所有3个DF的省和国家列都匹配的情况下该怎么做?
答案 0 :(得分:1)
请考虑将melt
的宽数据转换为长格式,然后再将merge
转换为位置和日期。然后运行所需的公式:
from functools import reduce
import pandas as pd
df_confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
"csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
df_deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
"csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
df_recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
"csse_covid_19_time_series/time_series_covid19_recovered_global.csv")
# MELT EACH DF IN LIST COMPREHENSION
df_list = [df.melt(id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long'],
var_name = 'Date', value_name = val)
for df, val in zip([df_confirmed, df_deaths, df_recovered],
['confirmed', 'deaths', 'recovered'])]
# CHAIN MERGE
df_long = reduce(lambda x,y: pd.merge(x, y, on=['Province/State', 'Country/Region', 'Lat', 'Long', 'Date']),
df_list)
# SIMPLE ARITHMETIC
df_long['active'] = df_long['confirmed'] - (df_long['recovered'] + df_long['deaths'])
输出 (按有效降序排列)
df_long.sort_values(['active'], ascending=False).head(10)
# Province/State Country/Region Lat Long Date confirmed deaths recovered active
# 15229 NaN US 37.0902 -95.7129 3/27/20 101657 1581 869 99207
# 14998 NaN US 37.0902 -95.7129 3/26/20 83836 1209 681 81946
# 15141 NaN Italy 43.0000 12.0000 3/27/20 86498 9134 10950 66414
# 14767 NaN US 37.0902 -95.7129 3/25/20 65778 942 361 64475
# 14910 NaN Italy 43.0000 12.0000 3/26/20 80589 8215 10361 62013
# 14679 NaN Italy 43.0000 12.0000 3/25/20 74386 7503 9362 57521
# 14448 NaN Italy 43.0000 12.0000 3/24/20 69176 6820 8326 54030
# 14536 NaN US 37.0902 -95.7129 3/24/20 53740 706 348 52686
# 15205 NaN Spain 40.0000 -4.0000 3/27/20 65719 5138 9357 51224
# 14217 NaN Italy 43.0000 12.0000 3/23/20 63927 6077 7024 50826