我有一个csv
文件的列表,这些文件位于同一目录中,并试图合并这两个文件,并制作一个新的csv
文件,其中包含两个输入文件的内容。这是2个输入文件的示例:
small_example1.csv
CodeClass,Name,Accession,Count
Endogenous,CCNO,NM_021147.4,18
Endogenous,MYC,NM_002467.3,1114
Endogenous,CD79A,NM_001783.3,178
Endogenous,FSTL3,NM_005860.2,529
small_example2.csv
CodeClass,Name,Accession,Count
Endogenous,CCNO,NM_021147.4,196
Endogenous,MYC,NM_002467.3,962
Endogenous,CD79A,NM_001783.3,390
Endogenous,FSTL3,NM_005860.2,67
这是预期的输出文件(result.csv
):
Probe_Name,Accession,Class_Name,small_example1,small_example2
CCNO,NM_021147.4,Endogenous,18,196
MYC,NM_002467.3,Endogenous,1114,962
CD79A,NM_001783.3,Endogenous,178,390
FSTL3,NM_005860.2,Endogenous,529,67
为此,我在python3中创建了此函数:
import pandas as pd
filenames = ['small_example1.csv', 'small_example2.csv']
path = '/home/Joy'
def convert(filenames):
for file in filenames:
df1 = pd.read_csv(file, skiprows=26, skipfooter=5, sep=',')
df = df1.merge(df2, on=['CodeClass', 'Name', 'Accession'])
df = df.rename(columns={'Name': 'Probe_Name',
'CodeClass': 'Class_Name',
file: file})
df.to_csv('result.csv')
结果看起来像这样,最后两列与预期的不一样(headers
和numbers
)。
Class_Name Probe_Name Accession Count_x Count_y
0 Endogenous CCNO NM_021147.4 18 18
1 Endogenous MYC NM_002467.3 1114 1114
2 Endogenous CD79A NM_001783.3 178 178
3 Endogenous FSTL3 NM_005860.2 529 529
您知道如何解决该问题吗?
答案 0 :(得分:1)
我建议您首先加载数据帧并将其存储在列表中,然后将它们全部合并在一起(根据需要,使用内部或外部联接):
import pandas as pd
from functools import reduce
filenames = ['small_example1.csv', 'small_example2.csv']
path = '/home/Joy'
def convert(filenames):
dataframes = []
# load all the dataframes in a list (dataframes)
for filename in filenames:
df = pd.read_csv(filename, skiprows=26, skipfooter=5, sep=',')
df = df.rename(columns={'Count': filename})
dataframes.append(df)
# merge the dataframes
df_merged = reduce(lambda x,y: pd.merge(x,y, on=['CodeClass', 'Name', 'Accession'], how='outer'), dataframes)
# rename the columns as you want and export the result
df_merged = df_merged.rename(columns={'Name': 'Probe_Name', 'CodeClass': 'Class_Name'})
df_merged.to_csv('result.csv')
答案 1 :(得分:0)
您在这里遇到两个问题,标题和值。
如果两次获得相同的值,则表示您已读取同一文件两次。您应该在加载时重命名Count
列,然后将数据框合并为最后一个:
import pandas as pd
filenames = ['small_example1.csv', 'small_example2.csv']
path = '/home/Joy'
def convert(filenames):
df = None # initialize the merged dataframe to None
for file in d:
# load a new dataframe and rename its Count columns
df1 = pd.read_csv(io.StringIO(d[file])).rename(columns={'Count': file})
# merge it into df
if df is None:
df = df1
else:
df = df.merge(df1, on=['CodeClass', 'Name', 'Accession'])
# rename and reindex the columns
result = df.rename(columns={'Name': 'Probe_Name', 'CodeClass': 'Class_Name'}
).reindex(['Probe_Name','Accession','Class_Name']+filenames,
axis=1)
result.to_csv('result.csv', index=False)