在python3中将csv文件解析并组合为另一个csv文件

时间:2019-06-26 11:05:22

标签: python-3.x pandas csv

我有一个csv文件的列表,这些文件位于同一目录中,并试图合并这两个文件,并制作一个新的csv文件,其中包含两个输入文件的内容。这是2个输入文件的示例:

small_example1.csv

    CodeClass,Name,Accession,Count
    Endogenous,CCNO,NM_021147.4,18
    Endogenous,MYC,NM_002467.3,1114
    Endogenous,CD79A,NM_001783.3,178
    Endogenous,FSTL3,NM_005860.2,529

small_example2.csv

    CodeClass,Name,Accession,Count
    Endogenous,CCNO,NM_021147.4,196
    Endogenous,MYC,NM_002467.3,962
    Endogenous,CD79A,NM_001783.3,390
    Endogenous,FSTL3,NM_005860.2,67

这是预期的输出文件(result.csv):

    Probe_Name,Accession,Class_Name,small_example1,small_example2
    CCNO,NM_021147.4,Endogenous,18,196
    MYC,NM_002467.3,Endogenous,1114,962
    CD79A,NM_001783.3,Endogenous,178,390
    FSTL3,NM_005860.2,Endogenous,529,67

为此,我在python3中创建了此函数:

    import pandas as pd
    filenames = ['small_example1.csv', 'small_example2.csv']
    path = '/home/Joy'
    def convert(filenames):
        for file in filenames:
            df1 = pd.read_csv(file, skiprows=26, skipfooter=5, sep=',')
            df = df1.merge(df2, on=['CodeClass', 'Name', 'Accession'])
            df = df.rename(columns={'Name': 'Probe_Name',
                            'CodeClass': 'Class_Name',
                             file: file})
            df.to_csv('result.csv')

结果看起来像这样,最后两列与预期的不一样(headersnumbers)。

        Class_Name  Probe_Name  Accession   Count_x Count_y
    0   Endogenous  CCNO    NM_021147.4 18  18
    1   Endogenous  MYC NM_002467.3 1114    1114
    2   Endogenous  CD79A   NM_001783.3 178 178
    3   Endogenous  FSTL3   NM_005860.2 529 529

您知道如何解决该问题吗?

2 个答案:

答案 0 :(得分:1)

我建议您首先加载数据帧并将其存储在列表中,然后将它们全部合并在一起(根据需要,使用内部或外部联接):

import pandas as pd
from functools import reduce

filenames = ['small_example1.csv', 'small_example2.csv']
path = '/home/Joy'

def convert(filenames):
    dataframes = []

    # load all the dataframes in a list (dataframes)
    for filename in filenames:
        df = pd.read_csv(filename, skiprows=26, skipfooter=5, sep=',')
        df = df.rename(columns={'Count': filename})
        dataframes.append(df)

    # merge the dataframes
    df_merged = reduce(lambda x,y: pd.merge(x,y, on=['CodeClass', 'Name', 'Accession'], how='outer'), dataframes)

    # rename the columns as you want and export the result
    df_merged = df_merged.rename(columns={'Name': 'Probe_Name', 'CodeClass': 'Class_Name'})
    df_merged.to_csv('result.csv')

答案 1 :(得分:0)

您在这里遇到两个问题,标题和值。

如果两次获得相同的值,则表示您已读取同一文件两次。您应该在加载时重命名Count列,然后将数据框合并为最后一个:

import pandas as pd
filenames = ['small_example1.csv', 'small_example2.csv']
path = '/home/Joy'
def convert(filenames):
    df = None               # initialize the merged dataframe to None
    for file in d:
        # load a new dataframe and rename its Count columns
        df1 = pd.read_csv(io.StringIO(d[file])).rename(columns={'Count': file})
        # merge it into df
        if df is None:
            df = df1
        else:
            df = df.merge(df1, on=['CodeClass', 'Name', 'Accession'])
    # rename and reindex the columns
    result = df.rename(columns={'Name': 'Probe_Name', 'CodeClass': 'Class_Name'}
                       ).reindex(['Probe_Name','Accession','Class_Name']+filenames,
                                 axis=1)
    result.to_csv('result.csv', index=False)