使用glob / merge删除NaN行,特定excel文件中的某些列

时间:2016-04-29 18:10:31

标签: python pandas glob

我想在excel文件中的for循环加载中删除最终文件中的NaN行,并删除除excel文件中最终加载的所有公司,电子邮件之外的所有列。

这是我的for循环(以及后续合并到单个DF中),目前:

for f in glob.glob("./gowall-users-export-*.xlsx"):
    df = pd.read_excel(f)
    all_users_sheets_hosts.append(df)
    j = re.search('(\d+)', f)
    df.columns = df.columns.str.replace('.*Hosted Meetings.*', 'Hosted Meetings' + ' ' + j.group(1))

all_users_sheets_hosts = reduce(lambda left,right: pd.merge(left,right,on=['First Name', 'Last Name'], how='outer'), all_users_sheets_hosts)

以下是生成的DF的前几行:

Company_x   First Name  Last Name   Emails_x    Created_x   Hosted Meetings 03112016    Facilitated Meetings_x  Attended Meetings_x Company_y   Emails_y    ... Created_x   Hosted Meetings 04122016    Facilitated Meetings_x  Attended Meetings_x Company_y   Emails_y    Created_y   Hosted Meetings 04212016    Facilitated Meetings_y  Attended Meetings_y
0   TS  X Y X@Y.com 03/10/2016  0.0 0.0 0.0 TS  X@Y.com ... 03/10/2016  0.0 0.0 2.0 NaN NaN NaN NaN NaN NaN
1   TS  X Y X@Y.com 03/10/2016  0.0 0.0 0.0 TS  X@Y.com ... 01/25/2016  0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN
2   TS  X Y X@Y.com 03/10/2016  0.0 0.0 0.0 TS  X@Y.com ... 04/06/2015  9.0 10.0    17.0    NaN NaN NaN NaN NaN NaN

1 个答案:

答案 0 :(得分:0)

要阻止多个CompanyEmailsCreatedFacilitated MeetingsAttended Meetings列,请从right DataFrame中删除它们。要删除包含所有NaN值的行,请使用result.dropna(how='all', axis=0)

import pandas as pd
import functools

for f in glob.glob("./gowall-users-export-*.xlsx"):
    df = pd.read_excel(f)
    all_users_sheets_hosts.append(df)
    j = re.search('(\d+)', f)
    df.columns = df.columns.str.replace('.*Hosted Meetings.*', 
                                        'Hosted Meetings' + ' ' + j.group(1))

# Drop rows of all NaNs from the final DataFrame in `all_users_sheets_hosts`
all_users_sheets_hosts[-1] = all_users_sheets_hosts[-1].dropna(how='all', axis=0)

def mergefunc(left, right):
    cols = ['Company', 'Emails', 'Created', 'Facilitated Meetings', 'Attended Meetings']
    right = right.drop(cols, axis=1)
    result = pd.merge(left, right, on=['First Name', 'Last Name'], how='outer')
    return result

all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)

Company等。人。列只会存在于left DataFrame中,这些列不会有扩散。但请注意,如果leftright数据框在这些列中具有不同的值,则只会保留all_users_sheets_hosts中第一个DataFrame中的值。

备选方案,如果leftright DataFrames具有Company et的相同值。人。列,然后另一个选项是在这些列上简单合并:

def mergefunc(left, right):
    cols = ['First Name', 'Last Name', 'Company', 'Emails', 'Created', 
            'Facilitated Meetings', 'Attended Meetings']
    result = pd.merge(left, right, on=cols, how='outer')
    return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)