删除重复的seq名称pandas

时间:2018-05-17 09:59:46

标签: python pandas sorting duplicates

我实际上有一个数据帧,这是一个例子:

cluster     seq_sp1      seq_sp2
1           seq20        seq56
1           seq56        seq20
2           seq3         seq5
3           seq9         seq5
3           seq7         seq4
3           seq4         seq7

我想删除重复的序列: 此处的示例seq20 seq56是重复的,因为seq56 seq20seq7 seq4还有seq4 seq7cluster seq_sp1 seq_sp2 1 seq20 seq56 1 seq20 seq56 2 seq3 seq5 3 seq9 seq5 4 seq7 seq4 4 seq7 seq4

我想解决方案首先要对所有列进行排序:

   cluster     seq_sp1      seq_sp2
    1           seq20        seq56
    3           seq3         seq5
    4           seq9         seq5
    6           seq7         seq4

然后删除两个重复序列中的一个并获取:

cluster_name    qseqid  sseqid  pident_x    pident_y    length  qstart  qend    sstart  send    qspec   sspec
13  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93    93.0    1179    1   1175    1   1179    0035    0042
14  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93    93.0    1179    1   1175    1   1179    0035    0042
16  cluster_016663  EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1 0.93    93.0    1179    1   1175    1   1179    0035    0042
17  cluster_016663  EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1 0.93    93.0    1179    1   1175    1   1179    0035    0042
19  cluster_016663  EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0035_1 0.93    93.0    1179    1   1179    1   1175    0042    0035
20  cluster_016663  EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0042_1 0.93    93.0    1179    1   1179    1   1175    0042    0035
22  cluster_016663  EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0035_1 0.93    93.0    1179    1   1179    1   1175    0042    0035
23  cluster_016663  EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0042_1 0.93    93.0    1179    1   1179    1   1175    0042    0035

感谢您的帮助:)

你给我的剧本报告:

这是我的第一个数据的头部(参见图片以颜色显示重复的组)

    Unnamed: 0  cluster_name    qseqid  sseqid  pident_x    pident_y    length  qstart  qend    sstart  send    qspec   sspec
0   13  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93    93.0    1179    1   1175    1   1179    35  42
1   14  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93    93.0    1179    1   1175    1   1179    35  42
8   27  cluster_015764  EOG090X00LI_0035_0035_1 EOG090X00LI_0042_0042_1 0.8059999999999999  82.3    1013    1   1013    1   1008    35  42
9   28  cluster_015764  EOG090X00LI_0035_0035_1 EOG090X00LI_0042_0035_1 0.784   78.4    1013    1   1013    1   963 35  42
11  32  cluster_015764  EOG090X00LI_0042_0035_1 g1726.t1_0035_0042  0.67    58.5    1010    1   963 1   751 42  35

这是我得到的结果:

Unnamed: 0  cluster_name    qseqid  sseqid  pident_x    pident_y    length  qstart  qend    sstart  send    qspec   sspec
0   13  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93    93.0    1179    1   1175    1   1179    35  42
1   14  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93    93.0    1179    1   1175    1   1179    35  42

但我实际上得到了:

df=pd.read_table("dataframe.txt",header=0,sep='\t')

df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid ','sseqid'])
df.to_csv("df_test",sep='\t')

我使用了这段代码:

allow=autoplay

picture

4 个答案:

答案 0 :(得分:1)

我认为numpy.sort需要drop_duplicates - 返回已排序的行:

df[['seq_sp1','seq_sp2']] = np.sort(df[['seq_sp1','seq_sp2']], axis=1)
df = df.drop_duplicates(subset=['seq_sp1','seq_sp2'])
print (df)
   cluster seq_sp1 seq_sp2
0        1   seq20   seq56
2        2    seq3    seq5
3        3    seq5    seq9
4        3    seq4    seq7

或者DataFrame.duplicated使用~进行掩码,mask = pd.DataFrame(np.sort(df[['seq_sp1','seq_sp2']], axis=1), index=df.index).duplicated() df = df[~mask] print (df) cluster seq_sp1 seq_sp2 0 1 seq20 seq56 2 2 seq3 seq5 3 3 seq9 seq5 4 3 seq7 seq4 过滤boolean indexing - 输出中原始未排序的值:

df = df[['qseqid','sseqid']]
print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1
19  EOG090X00GO_0042_0035_1  EOG090X00GO_0035_0035_1
20  EOG090X00GO_0042_0035_1  EOG090X00GO_0035_0042_1
22  EOG090X00GO_0042_0042_1  EOG090X00GO_0035_0035_1
23  EOG090X00GO_0042_0042_1  EOG090X00GO_0035_0042_1

df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid','sseqid'])

print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1

编辑:

我用新数据测试它:

mask = pd.DataFrame(np.sort(df[['qseqid','sseqid']], axis=1), index=df.index).duplicated()
print (~mask)
13     True
14     True
16     True
17     True
19    False
20    False
22    False
23    False
dtype: bool

df = df[~mask]
print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1
 <select class="form-control" name='start'>
        <option selected disabled hidden value="">start</option>

    <option>1</option>
    <option>2</option>
    <option>3</option>
    <option>4</option>
    <option>6</option>
    <option>7</option>
    <option>8</option>
    <option>9</option>
    <option>10</option>

</select>


<select class="form-control" name='end'>
    <option selected disabled hidden value="">end</option>


    <option>2</option>
    <option>3</option>
    <option>4</option>
    <option>5</option>
    <option>7</option>
    <option>8</option>
    <option>9</option>
    <option>10</option>
    <option>11</option>
</select>

答案 1 :(得分:1)

例如:

df_set = df.apply(lambda x: str(sorted(set(x))), 1)

In: df[~df_set.duplicated()]
Out: 
        seq_sp1 seq_sp2
cluster                
1         seq20   seq56
2          seq3    seq5
3          seq9    seq5
3          seq7    seq4

答案 2 :(得分:1)

你可以试试这个:

#sorting rows and joining as string
df["seq_sorted"] = df.apply(lambda row: ",".join(x for x in sorted((row.seq_sp1,  row.seq_sp2))), axis=1)

#droping duplicates
df = df.drop_duplicates(subset="seq_sorted").drop(["seq_sorted"], axis=1)

答案 3 :(得分:0)

您可以使用pd.DataFrame.applyaxis=1上应用sorted。然后使用pd.Series.duplicated删除重复项。

dups = df[['seq_sp1', 'seq_sp2']].apply(sorted, axis=1).duplicated()
res = df[~dups]

print(res)

   cluster seq_sp1 seq_sp2
0        1   seq20   seq56
2        2    seq3    seq5
3        3    seq9    seq5
4        3    seq7    seq4