在熊猫数据框中的相关列中同时填充缺失值

时间:2018-08-09 14:20:57

标签: python python-3.x pandas dataframe

我有一个数据框,其中有两列“状态”和“代码”,每列中都缺少值。

    public interface DataSource {
        String getMovieName();
    }


    public class LocalDataSource implements DataSource {

        @Inject
        public LocalDataSource() {
        }

        @Override
        public String getMovieName() {
            return "";
        }
    }

    public class RemoteDataSource implements DataSource {

        @Inject
        public RemoteDataSource() {
        }

        @Override
        public String getMovieName() {
            return "";
        }
    }

    // Without factory.

    @Module
    public abstract class DataSourceModule {

        @Binds
        @Local
        abstract DataSource bindLocalDataSource(LocalDataSource localDataSource);

        @Binds
        @Remote
        abstract DataSource bindRemoteDataSource(RemoteDataSource remoteDataSource);
    }

    public class Repository implements DataSource {

        private final DataSource localDataSource;
        private final DataSource remoteDataSource;

        @Inject
        public Repository(@Local DataSource localDataSource, @Remote DataSource remoteDataSource) {
            this.localDataSource = localDataSource;
            this.remoteDataSource = remoteDataSource;
        }

        @Override
        public String getMovieName() {
            String name = localDataSource.getMovieName();
            return name.isEmpty() ? remoteDataSource.getMovieName() : name;
        }

    }

    //With data source factory

    public class DataSourceFactory{

        private final LocalDataSource localDataSource;
        private final RemoteDataSource remoteDataSource;

        @Inject
        public DataSourceFactory(LocalDataSource localDataSource, RemoteDataSource remoteDataSource) {
            this.localDataSource = localDataSource;
            this.remoteDataSource = remoteDataSource;
        }

        public static DataSource createLocalDataSource(){
            return localDataSource;
        }

        public static DataSource createRemoteDataSource(){
            return remoteDataSource;
        }

    }

    public class Repository implements DataSource {

        private final DataSource localDataSource;
        private final DataSource remoteDataSource;

        @Inject
        public Repository(DataSourceFactory dataSourceFactory) {
            this.localDataSource = dataSourceFactory.createLocalDataSource();
            this.remoteDataSource = dataSourceFactory.createRemoteDataSource();
        }

        @Override
        public String getMovieName() {
            String name = localDataSource.getMovieName();
            return name.isEmpty() ? remoteDataSource.getMovieName() : name;
        }

    }

缺少值

import pandas as pd

df = pd.DataFrame([['Alabama', 'AL'], ['Alaska', 'AK'], ['Arizona', 'AZ'], ['Arkansas', 'AR'], ['Iowa','IA'],['Hawaii','HI'], ['Idaho', 'ID'], ['Alabama', ''], ['', 'IA'], ['Alaska',''], ['', 'AZ']], columns=['State', 'Code'])

我尝试过的

    State   Code
7   Alabama     
8             IA
9   Alaska  
10            AZ

这将设置代码中的缺失值。我还需要更新此功能以设置状态。我正在试图简化这一点。

必填输出

state_code_dict = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'Iowa':'IA',
    'Hawaii':'HI',
    'Idaho': 'ID',    
}

def state_code(x):
    if (x['Code'] == ''):
        return state_code_dict[x['State']]
    else:
        return x['Code']

df['Code'] = df.apply(lambda x: state_code(x), axis=1)

4 个答案:

答案 0 :(得分:4)

IIUC,您可以使用map首先映射代码,然后使用状态,使用布尔掩码仅在有空值时分配值。

mask = df.Code == ''
df.loc[mask, 'Code'] = df[mask].State.map(state_code_dict)

mask = df.State == ''
df.loc[mask, 'State'] = df[mask].Code.map({v:k for k,v in state_code_dict.items()})

    State   Code
0   Alabama AL
1   Alaska  AK
2   Arizona AZ
3   Arkansas    AR
4   Iowa    IA
5   Hawaii  HI
6   Idaho   ID
7   Alabama AL
8   Iowa    IA
9   Alaska  AK
10  Arizona AZ

答案 1 :(得分:4)

您可以将空白字符串替换为np.nan,然后将fillnapd.Series.map一起使用。与@RafaelC类似,但实现方式不同。

code_state_dict = {v: k for k, v in state_code_dict.items()}

df.replace('', np.nan, inplace=True)
df['Code'].fillna(df['State'].map(state_code_dict), inplace=True)
df['State'].fillna(df['Code'].map(code_state_dict), inplace=True)

print(df)

       State Code
0    Alabama   AL
1     Alaska   AK
2    Arizona   AZ
3   Arkansas   AR
4       Iowa   IA
5     Hawaii   HI
6      Idaho   ID
7    Alabama   AL
8       Iowa   IA
9     Alaska   AK
10   Arizona   AZ

答案 2 :(得分:1)

填写代码

df['Code'] = df.apply(lambda x: x['Code'] if x['Code']!='' else state_code_dict[x['State']],axis=1)

要填写州

state_code_dict2 = {v: k for k, v in state_code_dict.items()}
df['State'] = df.apply(lambda x: x['State'] if x['State']!='' else state_code_dict2[x['Code']],axis=1)

答案 3 :(得分:0)

Filling a series based on key value pairs类似的问题

使用数据:

(df.replace('', np.nan)
  .sort_values(by=['State', 'Code'], ascending=False)
  .groupby('State').ffill().bfill()
  .groupby('Code').ffill().bfill())

输出:

    Code    State
4   IA  Iowa
6   ID  Idaho
5   HI  Hawaii
3   AR  Arkansas
2   AZ  Arizona
1   AK  Alaska
9   AK  Alaska
0   AL  Alabama
7   AL  Alabama
8   IA  Iowa
10  AZ  Arizona