Python仅从另一个数据框更新索引值

时间:2020-07-07 02:04:56

标签: python pandas dataframe

我有2个数据帧df1和df2。 df2数据帧是我提取的df1的子集以进行一些清理。两个数据框都可以在索引上匹配。我在网站上看到很多合并。我不想向df1中添加更多列,并且数据帧的大小不同df1有1000行,df2有275行,所以我不想替换整个列。我想用df2 ['AgeBin']值更新df1 ['AgeBin'],这些数据帧的索引匹配。

indexes = df.loc[df.AgeBin.isin(dfage_test.AgeBin.values)].index
df1.at[indexes,'AgeBin'] = df2['AgeBin'].values

这是我想出的,但由于df的大小不同,似乎存在问题

ValueError: Must have equal len keys and value when setting with an iterable

下面是一个过分的简化。 df1有26列,df2有12列,Agebin是两个dfs中的最后一列。从理论上讲,这就是我的目标

df2
    AgeBin
0     2 
1     3 
2     1 
3     3 


df1
    AgeBin
0     NaN 
1     NaN 
2     NaN 
3     NaN 

df1 after update
    AgeBin
0     2 
1     3 
2     1 
3     3 

以下是数据框规格

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   PassengerId       1046 non-null   float64 
 1   Survived          714 non-null    category
 2   Pclass            1046 non-null   category
 3   Name              1046 non-null   object  
 4   Sex               1046 non-null   object  
 5   Age               1046 non-null   float64 
 6   SibSp             1046 non-null   float64 
 7   Parch             1046 non-null   float64 
 8   Ticket            1046 non-null   object  
 9   Fare              1046 non-null   float64 
 10  Embarked          1046 non-null   category
 11  FamilySize        1046 non-null   float64 
 12  Surname           1046 non-null   object  
 13  Title             1046 non-null   object  
 14  IsChild           1046 non-null   float64 
 15  isMale            1046 non-null   category
 16  GroupID           1046 non-null   float64 
 17  GroupSize         1046 non-null   float64 
 18  GroupType         1046 non-null   object  
 19  GroupNumSurvived  1046 non-null   float64 
 20  GroupNumPerished  1046 non-null   float64 
 21  LargeGroup        1046 non-null   float64 
 22  SplitFare         1046 non-null   float64 
 23  log10Fare         1046 non-null   float64 
 24  log10SplitFare    1046 non-null   float64 
 25  AgeBin            1046 non-null   category
dtypes: category(5), float64(15), object(6)
memory usage: 221.9+ KB
  

dfageResults.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 5 to 1308
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   AgeBin  263 non-null    category
dtypes: category(1)
memory usage: 12.4 KB

这是类别

67] dfageResults.groupby(["AgeBin"])["AgeBin"].count()
AgeBin
0-14      25
15-29    192
30-44     46
Name: AgeBin, dtype: int64

[68] df.groupby(["AgeBin"])["AgeBin"].count()
AgeBin
0-14     107
15-29    462
30-44    301
45-59    136
60+       40
Name: AgeBin, dtype: int64

2 个答案:

答案 0 :(得分:1)

假设df2中的所有索引都存在于df1中(据我所知)-以下内容就足够了:

df1.loc[df2.index,:]=df2

如果上述index的假设不成立-这是替代方案(相同的结果-仅更新df1中的现有索引):

df1.loc[set(df2.index).intersection(set(df1.index)),:]=df2

样品输出(具有更多代表性样品数据):

import pandas as pd
import numpy as np

df1=pd.DataFrame({"AgeBin": [1,2,3,'x', np.nan,np.nan,'a']})

df2=pd.DataFrame({"AgeBin": ['new1', 'new2', 123]}, index=[5,2,3])

print(df1)
print(df2)
df1.loc[df2.index,:]=df2
print(df1)

输出:

  AgeBin
0      1
1      2
2      3
3      x
4    NaN
5    NaN
6      a

  AgeBin
5   new1
2   new2
3    123

  AgeBin
0      1
1      2
2   new2
3    123
4    NaN
5   new1
6      a

答案 1 :(得分:0)

尝试:

print('df2')
print(df2)

print('\ndf1')
print(df1)

df1.update(df2)

print('\ndf1 after update')
print(df1)

输出:

df2
  AgeBin
0  2    
1  3    
2  1    
3  3    

df1
   AgeBin
0 NaN    
1 NaN    
2 NaN    
3 NaN    

df1 after update
  AgeBin
0  2    
1  3    
2  1    
3  3