尝试合并到数据框中,但它不断创建新列

时间:2017-11-16 14:24:16

标签: python excel pandas

我正在尝试打开文件并从多个电子表格中导出2列(每行1行),然后将它们合并到基础电子表格中。所以,基础数据框(从电子表格派生,我只需要3列)是这样的:

Model |  Roadmap | Family
a       08/12/17  ROW
b       08/14/17  MACRO 
c       08/15/17  CONN 
d       08/27/17  MACRO 

多个电子表格中的数据框(模型名称是电子表格名称,并且我在多个数据框中导出的每个门都有多个日期)并具有以下格式:

    df1 (part1 -  the dataframe derived from the spreadsheet with model a for gate 0 ):
    Model   |  Gate 0 
    a         02/01/18  

df1 (Dataframe derived from the spreadsheet with model a for gate1):
        Model   |  Gate 1
        a         03/01/18   


   df2 (part1):
    Model  |  Gate 0 
    b       04/23/18   

df2 (part1):
        Model  |  Gate 1 
        b       05/23/18   

它产生的输出是:

Model |  Roadmap | Family | Gate 0_x  | Gate 1_x   | gate 0_y | Gate 1_y
a       08/12/17  ROW      02/01/18   03/01/18  
b       08/14/17  MACRO                              04/23/18  05/23/18     
c       08/15/17  CONN
d       08/27/17  MACRO 

我想要的输出:

  Model |  Roadmap | Family | Gate 0   | Gate 1   
   a       08/12/17  ROW     02/01/18   03/01/18
   b       08/14/17  MACRO    04/23/18  05/23/18 
   ..

以下是我正在使用的代码:

import glob
import pandas as pd
import re
import ntpath




extension = 'xlsx'
d='Final.xlsx'
c = 'Roadmap.xlsx'
dflist = []
z=[]
result = [i for i in glob.glob('*.{}'.format(extension))]

for b in result:
    if b==c:
        base_file = pd.read_excel(b, sheet_name='Antennas', header=7)
        ind1 = base_file.set_index('Model')
        ind1 = base_file[['Model', 'Roadmap', 'Family']]
        #print(ind1)
        ind1.to_excel('Final.xlsx')
        file3 = pd.read_excel('Final.xlsx')
        file3= file3.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)



for a in result:

        if a == c:
            base_file = pd.read_excel(a, sheet_name='Antennas', header=7)
            ind1 = base_file.set_index('Model')
            ind1 = base_file[['Model', 'Roadmap', 'Family']]
            ind1.to_excel('Final.xlsx')
        elif a != d:
            gates = ['Gate 0 Complete','Gate 1 Complete'] 
            file1 = pd.read_excel('Final.xlsx')
            file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)     
            #print(file1)
            file = pd.read_excel(a, sheet_name='Timeline')
            #print(file)
            models = pd.DataFrame([['','']], columns=['Model', gates])
            for g in gates:      
                z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
                v=ntpath.basename(a)
                v = v[5:-5]
                models = pd.DataFrame([[v,z]], columns =['Model',g])
                file1 = pd.merge(file1, models, how='left', on='Model')
            file3 = pd.merge(file3, file1, how='left' ,['Model','Roadmap','Family'])
            file3.to_excel('new.xlsx')

file3是我在for循环之前打开的文件作为基本文件的数据帧。如果有什么不清楚,请告诉我。

3 个答案:

答案 0 :(得分:2)

目前,您正在合并两次,但确实需要将 base 与各个dfs合并,然后将所有内容与pd.concat一起追加。

下面重新创建上面发布的示例,它们采用与Excel文件相同的结构,并演示合并和附加步骤。您会注意到drop_duplicates由于左连接合并而使用了相同的行值。在实际数据上保留或删除此方法。

数据

from io import StringIO
import pandas as pd

txt = '''
Model  Roadmap  Family
a      some_date  some
b      some_date  some 
c      some_date  some 
d      some_date  some
'''
base_df = pd.read_table(StringIO(txt), sep="\s+")

txt = '''
Model  "Gate 0" "Gate 1"
    a   some_date  some 
'''
df1 = pd.read_table(StringIO(txt), sep="\s+")

txt = '''
Model  "Gate 0" "Gate 1"
    b   some_date  some 
'''
df2 = pd.read_table(StringIO(txt), sep="\s+")

合并和附加 (使用列表理解)

finaldf = pd.concat([pd.merge(base_df, df, how='left', on='Model') 
                    for df in [df1, df2]], ignore_index=True).drop_duplicates()

print(finaldf)
#   Model    Roadmap Family     Gate 0 Gate 1
# 0     a  some_date   some  some_date   some
# 1     b  some_date   some        NaN    NaN
# 2     c  some_date   some        NaN    NaN
# 3     d  some_date   some        NaN    NaN
# 4     a  some_date   some        NaN    NaN
# 5     b  some_date   some  some_date   some

要集成到当前流程中,请考虑将单个模型附加到列表中,以便在最后进行连接和合并。构建 base_df 作为上面发布的示例。

...
dfList = []

for g in gates:      
     z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
     v = ntpath.basename(a)
     v = v[5:-5]
     mod = pd.DataFrame([[v,z]], columns =['Model',g])
     models = pd.merge(models, mod, how='left', on='Model')
dfList.append(models)

finaldf = pd.merge(base_df, pd.concat(dfList), how='left', on='Model')
finaldf.to_excel('Final_Dataset.xlsx')

答案 1 :(得分:1)

知道该怎么做。如果您发现任何问题,请告诉我。

import glob
import pandas as pd
import re
import ntpath

extension = 'xlsx'
d='Final.xlsx'
c = 'Roadmap.xlsx'
dflist = []
z=[]
result = [i for i in glob.glob('*.{}'.format(extension))]

for a in result:

    if a == c:
        base_file = pd.read_excel(a, sheet_name='Antennas', header=7)
        ind1 = base_file.set_index('Model')
        ind1 = base_file[['Model', 'Roadmap', 'Family']]
        #print(ind1)
        ind1.to_excel('Final.xlsx')
    elif a != d:
        v=ntpath.basename(a)
        v = v[5:-5]
        gates = ['Gate 0 Complete','Gate 1 Complete', 'Gate 2 Complete'] 
        file1 = pd.read_excel('Final.xlsx')
        file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)     
        #print(file1)
        file = pd.read_excel(a, sheet_name='Timeline')
        #print(file)
        models = pd.DataFrame([[v]], columns=['Model'])
        #print(models)
        for g in gates:      
            z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
            #print(z)
            #v = re.findall(r'Scrum(\w+)', a)
            #print(v)
            #df1=pd.DataFrame([[v,z]], columns = ['Model',g])
            mod = pd.DataFrame([[v,z]], columns =['Model',g])
            models=pd.merge(models, mod, how='left', on='Model')
            #print(models)
        dflist.append(models)
        #print(dflist)
file1 = pd.merge(file1,pd.concat(dflist), how='left',on='Model')
file1.to_excel('new.xlsx')

答案 2 :(得分:0)

我假设您的原始数据如下:

  1. 第0步 - 第1部分。您加载df_base
  2. 第0步 - 第2部分。您加载df1df2等 - 每张工作表df
  3. 然后我的方法是按顺序执行以下步骤:

    1. 将所有工作表的df垂直连接到名为df_sheets
    2. 的单个DataFrame中
    3. df_basedf_sheets合并以获得所需的输出
    4. 基于此,我的方法是:

      import pandas as pd
      
      # STEP 0.
      cv = ['a','b','c','d']
      nr = 4
      
      # STEP 0 - Part 1. Load Base DF
      cv = cv[:nr]
      df_base = pd.DataFrame(zip(*[cv,['some_date']*nr,['some']*nr]),
                        columns=['Model','Roadmap','Family'])
      
      # STEP 0 - Part 2. Load Sheets DataFrames
      df_sheets = []
      for alph in cv:
          df_sheet = pd.DataFrame(zip(*[[alph]*nr,['some_date_'+alph]*nr,['some_'+alph]*nr]),
                                  columns=['Model','Gate0','Gate1'])
          df_sheets.append(df_sheet)
      print('Base DF:\n{}' .format(df_base))
      
      
      # STEP 1. Vertically conctenate all sheets DataFrames together
      df_sheets = pd.concat(df_sheets, axis=0).reset_index(drop=True)
      print('\nDataFrames for all sheets (vertically concatenated into single DataFrame):\n{}'
          .format(df_sheets))
      
      
      # STEP 2. base INNER JOIN sheets USING ('Model')
      ndf = df_base.merge(df_sheets, on='Model', how='inner')
      print('\nOutput DataFrame:\n{}' .format(ndf))
      

      输出是:

      Base DF:
        Model    Roadmap Family
      0     a  some_date   some
      1     b  some_date   some
      2     c  some_date   some
      3     d  some_date   some
      
      DataFrames for all sheets (vertically concatenated into single DataFrame):
         Model        Gate0   Gate1
      0      a  some_date_a  some_a
      1      a  some_date_a  some_a
      2      a  some_date_a  some_a
      3      a  some_date_a  some_a
      4      b  some_date_b  some_b
      5      b  some_date_b  some_b
      6      b  some_date_b  some_b
      7      b  some_date_b  some_b
      8      c  some_date_c  some_c
      9      c  some_date_c  some_c
      10     c  some_date_c  some_c
      11     c  some_date_c  some_c
      12     d  some_date_d  some_d
      13     d  some_date_d  some_d
      14     d  some_date_d  some_d
      15     d  some_date_d  some_d
      
      Output DataFrame:
         Model    Roadmap Family        Gate0   Gate1
      0      a  some_date   some  some_date_a  some_a
      1      a  some_date   some  some_date_a  some_a
      2      a  some_date   some  some_date_a  some_a
      3      a  some_date   some  some_date_a  some_a
      4      b  some_date   some  some_date_b  some_b
      5      b  some_date   some  some_date_b  some_b
      6      b  some_date   some  some_date_b  some_b
      7      b  some_date   some  some_date_b  some_b
      8      c  some_date   some  some_date_c  some_c
      9      c  some_date   some  some_date_c  some_c
      10     c  some_date   some  some_date_c  some_c
      11     c  some_date   some  some_date_c  some_c
      12     d  some_date   some  some_date_d  some_d
      13     d  some_date   some  some_date_d  some_d
      14     d  some_date   some  some_date_d  some_d
      15     d  some_date   some  some_date_d  some_d
      

      这就是你想要的吗?