与NA左联

时间:2019-01-16 13:55:40

标签: python pandas join

我有以下数据框:

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'proj': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'C'],
    'country': ['POL', 'POL', 'POL', 'POL', 'USA', 'USA', 'POL', 'USA', 'USA', 'USA', 'FRA', np.NaN],
    'date': ['2018-08-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-09-01', '2018-10-01', '2018-06-01', '2018-07-01', '2018-08-01', '2018-09-01', '2018-09-01', np.NaN],
    'feature_proj': [100, 100, 100, 100, 100, 100, 106, 106, 106, 106, 106, 110],
    'feature_country': [1, 1, 1, 1, 2, 2, 3, 4, 4, 4, 5, np.NaN],
    'feature_date': [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, np.NaN]
})

first df to join

我想加入的

forecastFor = pd.DataFrame({
    'proj': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D'],
    'country': ['POL', 'POL', 'POL', 'POL', 'USA', 'USA', 'POL', 'USA', 'USA', 'GER', 'POL', 'USA'],
    'date': ['2018-07-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-09-01', '2018-10-01', '2018-06-01', '2018-07-01', '2018-08-01', '2018-10-01', '2018-11-01', '2018-11-01'],
    'hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})

second df to join

以某种方式使我最终得到:

expected = pd.DataFrame({
    'proj': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
    'country': ['POL', 'POL', 'POL', 'POL', 'USA', 'USA', 'POL', 'USA', 'USA', 'GER', 'POL'],
    'date': ['2018-07-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-09-01', '2018-10-01', '2018-06-01', '2018-07-01', '2018-08-01', '2018-10-01', '2018-11-01'],
    'hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'feature_proj': [100, 100, 100, 100, 100, 100, 106, 106, 106, 110, 110],
    'feature_country': [1, 1, 1, 1, 2, 2, 3, 4, 4, np.NaN, np.NaN],
    'feature_date': [np.NaN, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, np.NaN, np.NaN]
})

expected after join

因此,我有一个具有不同级别功能的数据框(data)。有项目级别的功能,有项目&国家/地区的功能,还有项目&国家/日期的功能。我还有第二个数据框(forecastFor),其中每个项目国家/地区日期元组(在此命名为hours)包含一些值。我想加入这两个,以便:

  • 结果中,我的记录与forecastFor中的记录相同,但是增加了功能列。应该没有其他记录,但是如果proj列不匹配,则记录可能会被删除
  • inner列中,联接应为proj类型-它们应全部匹配,并且所有不匹配的记录均不应包含在结果中
  • countrydate上的匹配项应为left类型(pandas会删除完成连接的列中具有NA的记录,但我仍希望具有匹配的{ {1}},即使proj为NA)

有什么想法吗?

3 个答案:

答案 0 :(得分:0)

pd.merge(forecastFor, data, how='left')

让您靠近...

0   A   POL 2018-07-01  1   NaN NaN NaN
1   A   POL 2018-09-01  2   100.0   1.0 1001.0
2   A   POL 2018-10-01  3   100.0   1.0 1002.0
3   A   POL 2018-11-01  4   100.0   1.0 1003.0
4   A   USA 2018-09-01  5   100.0   2.0 1004.0
5   A   USA 2018-10-01  6   100.0   2.0 1005.0
6   B   POL 2018-06-01  7   106.0   3.0 1006.0
7   B   USA 2018-07-01  8   106.0   4.0 1007.0
8   B   USA 2018-08-01  9   106.0   4.0 1008.0
9   C   GER 2018-10-01  10  NaN NaN NaN
10  C   POL 2018-11-01  11  NaN NaN NaN
11  D   USA 2018-11-01  12  NaN NaN NaN```

答案 1 :(得分:0)

对不起,这很简单,写下问题后,我立刻想出了答案:

projLevelFeaturesData = data[['proj', 'feature_proj']].drop_duplicates()
countryLevelFeaturesData = data[['proj', 'country', 'feature_country']].drop_duplicates().dropna()
dateLevelFeaturesData = data[['proj', 'country', 'date', 'feature_date']].drop_duplicates().dropna()
projJoined = forecastFor.merge(projLevelFeaturesData, on=['proj'], how='inner')
countryJoined = projJoined.merge(countryLevelFeaturesData, on=['proj', 'country'], how='left')
joined = countryJoined.merge(dateLevelFeaturesData, on=['proj', 'country', 'date'], how='left')

答案 2 :(得分:0)

您只需在forecastFor中包含proj中有data的{​​{1}}列之前像这样左移:

forecastFor=forecastFor[forecastFor['proj'].isin(data.proj.unique())]
df=forecastFor.merge(data, on=['proj','country','date'], how='left')

输出

   proj country        date  hours  feature_proj  feature_country  \
0     A     POL  2018-07-01      1           NaN              NaN   
1     A     POL  2018-09-01      2         100.0              1.0   
2     A     POL  2018-10-01      3         100.0              1.0   
3     A     POL  2018-11-01      4         100.0              1.0   
4     A     USA  2018-09-01      5         100.0              2.0   
5     A     USA  2018-10-01      6         100.0              2.0   
6     B     POL  2018-06-01      7         106.0              3.0   
7     B     USA  2018-07-01      8         106.0              4.0   
8     B     USA  2018-08-01      9         106.0              4.0   
9     C     GER  2018-10-01     10           NaN              NaN   
10    C     POL  2018-11-01     11           NaN              NaN   

    feature_date  
0            NaN  
1         1001.0  
2         1002.0  
3         1003.0  
4         1004.0  
5         1005.0  
6         1006.0  
7         1007.0  
8         1008.0  
9            NaN  
10           NaN