合并的数据框似乎缺少两行

时间:2018-02-22 06:22:29

标签: python-3.x pandas dataframe indexing merge

我运行了以下代码:

df1 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

print(pd.merge(df1,df3, on='HPI'))

我得到的输出为:

    HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   85         3                55            52             8
2   85         3                55            53             6
3   85         2                55            52             8
4   85         2                55            53             6
5   88         2                65            50             9

我的问题是

1)为什么我拥有如此庞大的数据帧。 HPI只有4个值,但在输出中已生成6行。

2)如果合并将从HPI获取所有值,那么为什么值80和88没有被取两次?

2 个答案:

答案 0 :(得分:1)

您获得85次4次,因为已加入的列df1中的df2HPI重复。 88 80inner join是唯一的,因此内连接每个返回一行。

显然,df1 = df1.drop_duplicates('HPI') df3 = df3.drop_duplicates('HPI') 表示如果两个表中的连接列都匹配,则每行将匹配最大可能的次数。

因此在合并之前需要删除重复项以获得正确的输出。

HPI

具有#2dupes 85 df1 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'US_GDP_Thousands':[50, 55, 65, 55]}, index = [2001, 2002, 2003, 2004]) #2dupes 85 df3 = pd.DataFrame({'HPI':[80,85,88,85], 'Unemployment':[7, 8, 9, 6], 'Low_tier_HPI':[50, 52, 50, 53]}, index = [2001, 2002, 2003, 2004]) #4dupes 85 - 2x2, value 85 in both columns print(pd.merge(df1,df3, on='HPI')) HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment 0 80 2 50 50 7 1 85 3 55 52 8 2 85 3 55 53 6 3 85 2 55 52 8 4 85 2 55 53 6 5 88 2 65 50 9 列和输出中的dupes值的样本:

#2 dupes 80, 2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,80,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#2dupes 85 , unique 80       
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#4dupes 80, 2x1, 4dupes 85 - 2x2, values 80,85 in both columns
print(pd.merge(df1,df3, on='HPI'))
   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   80         2                65            50             7
2   85         3                55            52             8
3   85         3                55            53             6
4   85         2                55            52             8
5   85         2                55            53             6
#2dupes 80
df1 = pd.DataFrame({'HPI':[80,80,82,83],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#2 dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#2dupes 80, 2x1value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   80         3                55            50             7
#4dupes 80
df1 = pd.DataFrame({'HPI':[80,80,80,80],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])
#3 dupes 80
df3 = pd.DataFrame({'HPI':[80,80,80,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

#12dupes 80, 4x3, value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
    HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0    80         2                50            50             7
1    80         2                50            52             8
2    80         2                50            50             9
3    80         3                55            50             7
4    80         3                55            52             8
5    80         3                55            50             9
6    80         2                65            50             7
7    80         2                65            52             8
8    80         2                65            50             9
9    80         2                55            50             7
10   80         2                55            52             8
11   80         2                55            50             9
[TestCaseSource(typeof(mockData), nameof(mockData.calculation))]
        public void TestCaching(CalcRequest CalcRequest, CalcResponse CalcResponseExpect)
        {

        //Arrange
        _userService.CalcByService(CalcRequest).Returns(CalcResponseExpect);

            var mvcController1 = new mvcController1(_userService);

            IEnumerable<Mortgage> obj = null;

            for (int i = 0; i < 5; i++)
            {
                var result = mvcController1.ServiceConsumeToList() as JsonResult;
                obj = (IEnumerable<Mortgage>)result.Data;
            }

            Assert.IsNotNull(obj);
        }


//ASP.NET MVC CONTROLLER:   

[HttpGet]
        [OutputCache(Duration = 86400, VaryByParam = "none", Location = OutputCacheLocation.Server)]
        public JsonResult ServiceConsumeToList()
        {
            IEnumerable<CalcOuput> getServiceOuput = _Service.ServiceConsumeToList();
            return Json(getServiceOuput, JsonRequestBehavior.AllowGet);
        }

答案 1 :(得分:1)

正如jezrael所写,你有6行,因为df1和df3中HPI=85的值不是唯一的。相反,在df1和df3上,您只有HPI=80HPI=88的值。 如果我做出一个假设并考虑你的index,我猜你想要的是这样的:

       HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
index                                                             
2001    80         2                50            50             7
2002    85         3                55            52             8
2003    88         2                65            50             9
2004    85         2                55            53             6

如果您想要这样的话,那么您可以这样做:

pd.merge(df1, df3, left_index=True, right_index=True, on='HPI')

但我只是在假设,所以我不知道这是否是您想要的输出。