如何使用组的最小值填充同一列中的NaN值

时间:2018-02-20 07:30:02

标签: python pandas dataframe

如何使用组的最小值填充同一列中的NaN值 - 请参阅下面的df和df2。对于'A'列中的类别'2',我希望有min(20,15)...请帮助:)

public void readEmails() throws Exception{
        // mail server connection parameters
        String host = "host";
        String user = "username";
        String pwd = "pwd";

        // connect to my pop3 inbox
        Properties properties = System.getProperties();
        Session session = Session.getDefaultInstance(properties);
        Store store = session.getStore("pop3");
        store.connect(host, user, pwd);
        Folder inbox = store.getFolder("INBOX");

        inbox.open(Folder.READ_ONLY);

        // get the list of inbox messages
        Message[] messages = inbox.getMessages();

        if (messages.length == 0) System.out.println("No messages found.");

        for (int i = 0; i < messages.length; i++) {
            // stop after listing ten messages
            if (i > 10) {
                System.exit(0);
                inbox.close(true);
                store.close();
            }
            final MimeMessageParser mimeMessageParser = new MimeMessageParser((MimeMessage) messages[i]);
            mimeMessageParser.parse();
            if (mimeMessageParser.hasAttachments()) {
                List<DataSource> attachmentList = mimeMessageParser.getAttachmentList();
                System.out.println("Number of attachments: " +attachmentList.size());
                for (DataSource attachment:attachmentList
                     ) {
                    System.out.println("Name: "+attachment.getName()+"  Content Type: "+attachment.getContentType());
                    if (attachment.getContentType().equals("message/rfc822")) {
                        final MimeMessage message = new MimeMessage(null,attachment.getInputStream());
                        System.out.println("Subject of the attached failure Mail:" + message.getSubject());

                    }
                }
            }

            System.out.println("Message " + (i + 1));
            System.out.println("From : " + messages[i].getFrom()[0]);
            System.out.println("Subject : " + messages[i].getSubject());
            System.out.println("Sent Date : " + messages[i].getSentDate());
            System.out.println();
        }

        inbox.close(true);
        store.close();
    }

如何获得df2,没有循环 - ?

import pandas as pd  
import numpy as np


df = pd.DataFrame({"A": [1,1,2,2,2,3,3,3,3,4,4], 
               "B": [ np.nan , 10, np.nan, 20, 15, np.nan,np.nan,np.nan,np.nan,40, np.nan]})

In[1]: df
Out[1]: 
    A     B
0   1   NaN
1   1  10.0
2   2   NaN
3   2  20.0
4   2  15.0
5   3   NaN
6   3   NaN
7   3   NaN
8   3   NaN
9   4  40.0
10  4   NaN

2 个答案:

答案 0 :(得分:3)

如果要按每个组min替换所有值,请使用GroupBy.transform

df['B'] = df.groupby('A')['B'].transform('min')
print (df)
    A     B
0   1  10.0
1   1  10.0
2   2  15.0
3   2  15.0
4   2  15.0
5   3   NaN
6   3   NaN
7   3   NaN
8   3   NaN
9   4  40.0
10  4  40.0

如果只想将NaN替换为min添加fillna或使用自定义lambda函数:

df['B'] = df.B.fillna(df.groupby('A')['B'].transform('min'))

替代:

df['B'] = df.groupby('A')['B'].transform(lambda x: x.fillna(x.min()))

print (df)
    A     B
0   1  10.0
1   1  10.0
2   2  15.0
3   2  20.0
4   2  15.0
5   3   NaN
6   3   NaN
7   3   NaN
8   3   NaN
9   4  40.0
10  4  40.0

答案 1 :(得分:2)

作为一项实验,我想知道我是否可以用Numpy做到这一点。这并不完美,因为它没有处理负值,或者就此而言是零。我可以改变它这样做,但是,这是原型。

b = df.B.values
a = df.A.values

a_, u_ = pd.factorize(a)
_a = a_.max() - a_

maxb = np.nanmax(b)

basis_inc = a_ * maxb
basis_dec = _a * maxb
bnan = np.isnan(b)
bfill_zero = np.where(bnan, maxb + 1, b)

ffill_min = np.minimum.accumulate(bfill_zero + basis_dec) - basis_dec
bfill_min = np.minimum.accumulate((bfill_zero + basis_inc)[::-1])[::-1] - basis_inc

gmin = np.minimum(ffill_min, bfill_min)
df.assign(B=np.where(bnan & (gmin != maxb + 1), gmin, b))

    A     B
0   1  10.0
1   1  10.0
2   2  15.0
3   2  20.0
4   2  15.0
5   3   NaN
6   3   NaN
7   3   NaN
8   3   NaN
9   4  40.0
10  4  40.0