在包含2列的目录中有多个文件(20),例如
transcript_id value
ENMUST001 2
ENMUST003 3
ENMUST004 5
每个文件中的行数不同我想做的是以这种方式将所有20个文件合并到一个巨大的矩阵中
transcript_id value_file1 value_file2....value_file20
ENMUST001 2 3
ENMUST003 3 4
ENMUST004 5 0
收集transcript_id列中的所有ID以及每个文件中的相应值(文件名作为列名),如果没有值,则使用0。
我尝试使用pandas做到这一点,
import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
df = pd.read_csv(file, sep='\t')
df.set_index('transcript_id')
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)
Error:
ValueError: No objects to concatenate
想知道非熊猫的方式是否更好?任何伪代码都表示赞赏。
输出
df.set_index('transcript_id')
print (df.shape)
(921, 1)
(1414, 1)
(659, 1)
(696, 1)
(313, 1)
print (df.is_unique)
(921, 1)
False
(1414, 1)
False
(659, 1)
False
(696, 1)
False
(313, 1)
False
df = df.drop_duplicates(inplace=True)
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)
New error
ValueError: All objects passed were None
before: (921, 1)
after: (914, 1)
before: (1414, 1)
after: (1410, 1)
before: (659, 1)
after: (658, 1)
before: (696, 1)
after: (694, 1)
before: (313, 1)
after: (312, 1)
答案 0 :(得分:3)
set_index的默认行为是df.set_index('transcript_id')
。尝试将df = df.set_index('transcript_id')
替换为df = df[~df.index.duplicated(keep='first')]
。您还可以使用import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
df = pd.read_csv(file, sep='\t')
df = df.set_index('transcript_id') # set index
df = df[~df.index.duplicated(keep='first')] # remove duplicates
df.columns = [os.path.split(file)[-1]] # set column name to filename
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)
删除索引中的重复值。
#grid {
display: grid;
grid-template-columns: 50% 50%;
}
.col {
background: #ccc;
height: 100vh;
}
.col:nth-child(2) {
background: #aaa;
}
.bottom {
position: fixed;
bottom: 0;
background: blue;
color: white;
width: inherit;
}