Question

在包含2列的目录中有多个文件（20），例如

transcript_id value
ENMUST001     2
ENMUST003     3
ENMUST004     5

每个文件中的行数不同我想做的是以这种方式将所有20个文件合并到一个巨大的矩阵中

transcript_id value_file1 value_file2....value_file20
ENMUST001     2  3 
ENMUST003     3  4
ENMUST004     5  0

收集transcript_id列中的所有ID以及每个文件中的相应值（文件名作为列名），如果没有值，则使用0。

我尝试使用pandas做到这一点，

import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df.set_index('transcript_id')
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0) 

Error:
ValueError: No objects to concatenate

想知道非熊猫的方式是否更好？任何伪代码都表示赞赏。

修改

找到

输出

df.set_index('transcript_id')
print (df.shape)

    (921, 1)
    (1414, 1)
    (659, 1)
    (696, 1)
    (313, 1)
print (df.is_unique)
    (921, 1)
False
(1414, 1)
False
(659, 1)
False
(696, 1)
False
(313, 1)
False
df = df.drop_duplicates(inplace=True)
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)

New error
ValueError: All objects passed were None

重复打印

before:  (921, 1)
after:  (914, 1)
before:  (1414, 1)
after:  (1410, 1)
before:  (659, 1)
after:  (658, 1)
before:  (696, 1)
after:  (694, 1)
before:  (313, 1)
after:  (312, 1)

Answer 1

set_index的默认行为是df.set_index('transcript_id')。尝试将df = df.set_index('transcript_id')替换为df = df[~df.index.duplicated(keep='first')]。您还可以使用import os import glob import pandas as pd path = 'pathtofiles' transFiles = glob.glob(path + "*.tsv") df_files = [] for file in transFiles: df = pd.read_csv(file, sep='\t') df = df.set_index('transcript_id') # set index df = df[~df.index.duplicated(keep='first')] # remove duplicates df.columns = [os.path.split(file)[-1]] # set column name to filename df_files.append(df) df_combine = pd.concat(df_files, axis=1).fillna(0)删除索引中的重复值。

#grid {
  display: grid;
  grid-template-columns: 50% 50%;
}

.col {
  background: #ccc;
  height: 100vh;
}

.col:nth-child(2) {
  background: #aaa;
}

.bottom {
  position: fixed;
  bottom: 0;
  background: blue;
  color: white;
  width: inherit;
}

使用pandas从多个文件构建矩阵

修改

重复打印

1 个答案: