Python / Pandas:仅当每个Excel文件包含某些值时,才循环附加Excel文件

时间:2018-08-20 21:01:30

标签: python pandas numpy openpyxl pyexcel

我有以下代码:

import locale

locale._override_localeconv["thousands_sep"] = "."
locale._override_localeconv["decimal_point"] = ","

print locale.atof('123.456,78')

仅当每个Excel文件在以下位置(列和行组合)dfs = [] for f in files_xlsx: city_name = pd.read_excel(f, "1. City", nrows=1, parse_cols="C", header=None, skiprows=1) country_code = pd.read_excel(f, "1. City", nrows=1, parse_cols="C", header=None, skiprows=2) data = pd.read_excel(f, "1. City", parse_cols="B:J", header=None, skiprows=8) data['City name'] = city_name.iat[0,0] data['City code'] = country_code.iat[0,0] dfs.append(data) df = pd.concat(dfs, ignore_index=True) 中包含值91, 92, 93, 94, 95, 96, 97时,我才想运行循环。仅当所有文件都满足此条件时,循环才应运行。

我所有的Excel文件在理论上都具有相同的格式。实际上,它们通常不这样做,因此我想在附加它们之前先进行检查。如果代码可以告诉我哪个文件满足上述条件,那就太好了。谢谢。

修改

D8, E8, F8, G8, H8, I8, J8

1 个答案:

答案 0 :(得分:0)

请考虑构建一个类似于Excel数据但其值与规范匹配的辅助数据框(第8行和具有这些特定值的D-J列)。然后在循环中迭代与此辅助数据框架合并,如果返回匹配,则有条件地附加到您的数据框架列表中。

注意::将列调整为实际的列名,在list('ABCDEFGHIJ')['col1','col2','col3',...]调用中将DataFrame()替换为名称列表merge()

check_df = pd.concat([pd.DataFrame([[0]*9 for _ in range(7)], 
                                    columns=['Heading code','Heading name','91','92','93','94','95','96','97']),
                      pd.DataFrame([[0,0,91,92,93,94,95,96,97]], 
                                   columns=['Heading code','Heading name','91','92','93','94','95','96','97'])],
                      ignore_index=True).reset_index()

print(check_df)    
#    index  Heading code  Heading name  91  92  93  94  95  96  97
# 0      0             0             0   0   0   0   0   0   0   0
# 1      1             0             0   0   0   0   0   0   0   0
# 2      2             0             0   0   0   0   0   0   0   0
# 3      3             0             0   0   0   0   0   0   0   0
# 4      4             0             0   0   0   0   0   0   0   0
# 5      5             0             0   0   0   0   0   0   0   0
# 6      6             0             0   0   0   0   0   0   0   0
# 7      7             0             0  91  92  93  94  95  96  97

dfs = []
for f in files_xlsx:
   city_name = pd.read_excel(f, "1. City", nrows=1, parse_cols="C", header=None, skiprows=1)
   country_code = pd.read_excel(f, "1. City", nrows=1, parse_cols="C", header=None, skiprows=2)
   data = pd.read_excel(f, "1. City", parse_cols="B:J", header=None, skiprows=8)\
            .assign(city_name=city_name.iat[0,0], city_code=country_code.iat[0,0])
   data.columns = ['Heading code','Heading name','91','92','93','94','95','96','97','City name','City code']

   # INNER JOIN MERGE ON INDEX AND COLS, D-J
   tmp = data.reset_index().merge(check_df, on=['index','91','92','93','94','95','96','97'])    

   # CONDITIONALLY APPEND
   if len(tmp) > 0:
        dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

下面用随机数据演示

np.random.seed(82118)
# LIST OF FIVE DATAFRAMES (TO RESEMBLE EXCEL DFs)
rand_dfs = [pd.DataFrame([np.random.randint(1, 100, 9) for _ in range(10)],
                    columns=['Heading code','Heading name','91','92','93','94','95','96','97'])
            for _ in range(5)]

# UPDATE TWO DATAFRAMES EACH WITH 10 COLS TO INCLUDE MATCHING 8TH ROW
rand_dfs[2].loc[7] = [0, 0, 91, 92, 93, 94, 95, 96, 97]    
rand_dfs[4].loc[7] = [0, 0, 91, 92, 93, 94, 95, 96, 97]

final_dfs = []
for d in rand_dfs:
    tmp = d.reset_index().merge(check_df, on=['index','91','92','93','94','95','96','97'])

    if len(tmp) > 0: 
        final_dfs.append(d)

final_df = pd.concat(final_dfs, ignore_index=True)

输出(请参阅具有匹配条件的第8行和第17行)

print(final_df)

#     Heading code  Heading name  91  92  93  94  95  96  97
# 0             53            98  67   8  86  33  65  56  62
# 1             61             9  40  14  18   9  53  30  24
# 2             89            88  80  91  91  49   8  39  84
# 3             15            99  49  92  63  96  11  95  29
# 4             13            62  82  12  34  92  54  29  47
# 5             44            18  67  61  52  71  52  25  12
# 6             56            25  52  10  82  12  59  63  15
# 7              0             0  91  92  93  94  95  96  97
# 8             51            50  27  38  34  11  57  92   3
# 9             49            99  46  87  46   5  63  24   8
# 10            31            62   8  23  19  66  60  10  66
# 11            51            98  30  44  45  39  32  74  82
# 12            88            19  54  28  38  71   3  31  34
# 13            58            13  89  17  96  35  12  52  85
# 14            93            67  13  13  28  43  24   7   4
# 15            34            26  73  20  44  37  18  17  22
# 16            59             1  99   9  11   6   4  99  95
# 17             0             0  91  92  93  94  95  96  97
# 18            88             6  23  20  35  26  37  56  51
# 19            21            67  19  63  77  98  41   9  22