我有以下数据框:
import pandas as pd
import numpy as np
data = pd.DataFrame({
'proj': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'C'],
'country': ['POL', 'POL', 'POL', 'POL', 'USA', 'USA', 'POL', 'USA', 'USA', 'USA', 'FRA', np.NaN],
'date': ['2018-08-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-09-01', '2018-10-01', '2018-06-01', '2018-07-01', '2018-08-01', '2018-09-01', '2018-09-01', np.NaN],
'feature_proj': [100, 100, 100, 100, 100, 100, 106, 106, 106, 106, 106, 110],
'feature_country': [1, 1, 1, 1, 2, 2, 3, 4, 4, 4, 5, np.NaN],
'feature_date': [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, np.NaN]
})
我想加入的:
forecastFor = pd.DataFrame({
'proj': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D'],
'country': ['POL', 'POL', 'POL', 'POL', 'USA', 'USA', 'POL', 'USA', 'USA', 'GER', 'POL', 'USA'],
'date': ['2018-07-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-09-01', '2018-10-01', '2018-06-01', '2018-07-01', '2018-08-01', '2018-10-01', '2018-11-01', '2018-11-01'],
'hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})
以某种方式使我最终得到:
expected = pd.DataFrame({
'proj': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'country': ['POL', 'POL', 'POL', 'POL', 'USA', 'USA', 'POL', 'USA', 'USA', 'GER', 'POL'],
'date': ['2018-07-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-09-01', '2018-10-01', '2018-06-01', '2018-07-01', '2018-08-01', '2018-10-01', '2018-11-01'],
'hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'feature_proj': [100, 100, 100, 100, 100, 100, 106, 106, 106, 110, 110],
'feature_country': [1, 1, 1, 1, 2, 2, 3, 4, 4, np.NaN, np.NaN],
'feature_date': [np.NaN, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, np.NaN, np.NaN]
})
因此,我有一个具有不同级别功能的数据框(data
)。有项目级别的功能,有项目&国家/地区的功能,还有项目&国家/日期的功能。我还有第二个数据框(forecastFor
),其中每个项目国家/地区日期元组(在此命名为hours
)包含一些值。我想加入这两个,以便:
forecastFor
中的记录相同,但是增加了功能列。应该没有其他记录,但是如果proj
列不匹配,则记录可能会被删除inner
列中,联接应为proj
类型-它们应全部匹配,并且所有不匹配的记录均不应包含在结果中country
和date
上的匹配项应为left
类型(pandas会删除完成连接的列中具有NA的记录,但我仍希望具有匹配的{ {1}},即使proj
为NA)有什么想法吗?
答案 0 :(得分:0)
pd.merge(forecastFor, data, how='left')
让您靠近...
0 A POL 2018-07-01 1 NaN NaN NaN
1 A POL 2018-09-01 2 100.0 1.0 1001.0
2 A POL 2018-10-01 3 100.0 1.0 1002.0
3 A POL 2018-11-01 4 100.0 1.0 1003.0
4 A USA 2018-09-01 5 100.0 2.0 1004.0
5 A USA 2018-10-01 6 100.0 2.0 1005.0
6 B POL 2018-06-01 7 106.0 3.0 1006.0
7 B USA 2018-07-01 8 106.0 4.0 1007.0
8 B USA 2018-08-01 9 106.0 4.0 1008.0
9 C GER 2018-10-01 10 NaN NaN NaN
10 C POL 2018-11-01 11 NaN NaN NaN
11 D USA 2018-11-01 12 NaN NaN NaN```
答案 1 :(得分:0)
对不起,这很简单,写下问题后,我立刻想出了答案:
projLevelFeaturesData = data[['proj', 'feature_proj']].drop_duplicates()
countryLevelFeaturesData = data[['proj', 'country', 'feature_country']].drop_duplicates().dropna()
dateLevelFeaturesData = data[['proj', 'country', 'date', 'feature_date']].drop_duplicates().dropna()
projJoined = forecastFor.merge(projLevelFeaturesData, on=['proj'], how='inner')
countryJoined = projJoined.merge(countryLevelFeaturesData, on=['proj', 'country'], how='left')
joined = countryJoined.merge(dateLevelFeaturesData, on=['proj', 'country', 'date'], how='left')
答案 2 :(得分:0)
您只需在forecastFor
中包含proj
中有data
的{{1}}列之前像这样左移:
forecastFor=forecastFor[forecastFor['proj'].isin(data.proj.unique())]
df=forecastFor.merge(data, on=['proj','country','date'], how='left')
输出
proj country date hours feature_proj feature_country \
0 A POL 2018-07-01 1 NaN NaN
1 A POL 2018-09-01 2 100.0 1.0
2 A POL 2018-10-01 3 100.0 1.0
3 A POL 2018-11-01 4 100.0 1.0
4 A USA 2018-09-01 5 100.0 2.0
5 A USA 2018-10-01 6 100.0 2.0
6 B POL 2018-06-01 7 106.0 3.0
7 B USA 2018-07-01 8 106.0 4.0
8 B USA 2018-08-01 9 106.0 4.0
9 C GER 2018-10-01 10 NaN NaN
10 C POL 2018-11-01 11 NaN NaN
feature_date
0 NaN
1 1001.0
2 1002.0
3 1003.0
4 1004.0
5 1005.0
6 1006.0
7 1007.0
8 1008.0
9 NaN
10 NaN