我运行了以下代码:
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
print(pd.merge(df1,df3, on='HPI'))
我得到的输出为:
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
我的问题是
1)为什么我拥有如此庞大的数据帧。 HPI只有4个值,但在输出中已生成6行。
2)如果合并将从HPI获取所有值,那么为什么值80和88没有被取两次?
答案 0 :(得分:1)
您获得85
次4次,因为已加入的列df1
中的df2
和HPI
重复。 88
80
与inner join
是唯一的,因此内连接每个返回一行。
显然,df1 = df1.drop_duplicates('HPI')
df3 = df3.drop_duplicates('HPI')
表示如果两个表中的连接列都匹配,则每行将匹配最大可能的次数。
因此在合并之前需要删除重复项以获得正确的输出。
HPI
具有#2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 85 - 2x2, value 85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
列和输出中的dupes值的样本:
#2 dupes 80, 2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,80,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85 , unique 80
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 80, 2x1, 4dupes 85 - 2x2, values 80,85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 65 50 7
2 85 3 55 52 8
3 85 3 55 53 6
4 85 2 55 52 8
5 85 2 55 53 6
#2dupes 80
df1 = pd.DataFrame({'HPI':[80,80,82,83],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2 dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#2dupes 80, 2x1value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 3 55 50 7
#4dupes 80
df1 = pd.DataFrame({'HPI':[80,80,80,80],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#3 dupes 80
df3 = pd.DataFrame({'HPI':[80,80,80,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#12dupes 80, 4x3, value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 50 52 8
2 80 2 50 50 9
3 80 3 55 50 7
4 80 3 55 52 8
5 80 3 55 50 9
6 80 2 65 50 7
7 80 2 65 52 8
8 80 2 65 50 9
9 80 2 55 50 7
10 80 2 55 52 8
11 80 2 55 50 9
[TestCaseSource(typeof(mockData), nameof(mockData.calculation))]
public void TestCaching(CalcRequest CalcRequest, CalcResponse CalcResponseExpect)
{
//Arrange
_userService.CalcByService(CalcRequest).Returns(CalcResponseExpect);
var mvcController1 = new mvcController1(_userService);
IEnumerable<Mortgage> obj = null;
for (int i = 0; i < 5; i++)
{
var result = mvcController1.ServiceConsumeToList() as JsonResult;
obj = (IEnumerable<Mortgage>)result.Data;
}
Assert.IsNotNull(obj);
}
//ASP.NET MVC CONTROLLER:
[HttpGet]
[OutputCache(Duration = 86400, VaryByParam = "none", Location = OutputCacheLocation.Server)]
public JsonResult ServiceConsumeToList()
{
IEnumerable<CalcOuput> getServiceOuput = _Service.ServiceConsumeToList();
return Json(getServiceOuput, JsonRequestBehavior.AllowGet);
}
答案 1 :(得分:1)
正如jezrael所写,你有6行,因为df1和df3中HPI=85
的值不是唯一的。相反,在df1和df3上,您只有HPI=80
和HPI=88
的值。
如果我做出一个假设并考虑你的index
,我猜你想要的是这样的:
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
index
2001 80 2 50 50 7
2002 85 3 55 52 8
2003 88 2 65 50 9
2004 85 2 55 53 6
如果您想要这样的话,那么您可以这样做:
pd.merge(df1, df3, left_index=True, right_index=True, on='HPI')
但我只是在假设,所以我不知道这是否是您想要的输出。