我实际上有一个数据帧,这是一个例子:
cluster seq_sp1 seq_sp2
1 seq20 seq56
1 seq56 seq20
2 seq3 seq5
3 seq9 seq5
3 seq7 seq4
3 seq4 seq7
我想删除重复的序列:
此处的示例seq20 seq56
是重复的,因为seq56 seq20
和seq7 seq4
还有seq4 seq7
和cluster seq_sp1 seq_sp2
1 seq20 seq56
1 seq20 seq56
2 seq3 seq5
3 seq9 seq5
4 seq7 seq4
4 seq7 seq4
我想解决方案首先要对所有列进行排序:
cluster seq_sp1 seq_sp2
1 seq20 seq56
3 seq3 seq5
4 seq9 seq5
6 seq7 seq4
然后删除两个重复序列中的一个并获取:
cluster_name qseqid sseqid pident_x pident_y length qstart qend sstart send qspec sspec
13 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93 93.0 1179 1 1175 1 1179 0035 0042
14 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93 93.0 1179 1 1175 1 1179 0035 0042
16 cluster_016663 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1 0.93 93.0 1179 1 1175 1 1179 0035 0042
17 cluster_016663 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1 0.93 93.0 1179 1 1175 1 1179 0035 0042
19 cluster_016663 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0035_1 0.93 93.0 1179 1 1179 1 1175 0042 0035
20 cluster_016663 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0042_1 0.93 93.0 1179 1 1179 1 1175 0042 0035
22 cluster_016663 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0035_1 0.93 93.0 1179 1 1179 1 1175 0042 0035
23 cluster_016663 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0042_1 0.93 93.0 1179 1 1179 1 1175 0042 0035
感谢您的帮助:)
你给我的剧本报告:
这是我的第一个数据的头部(参见图片以颜色显示重复的组)
Unnamed: 0 cluster_name qseqid sseqid pident_x pident_y length qstart qend sstart send qspec sspec
0 13 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93 93.0 1179 1 1175 1 1179 35 42
1 14 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93 93.0 1179 1 1175 1 1179 35 42
8 27 cluster_015764 EOG090X00LI_0035_0035_1 EOG090X00LI_0042_0042_1 0.8059999999999999 82.3 1013 1 1013 1 1008 35 42
9 28 cluster_015764 EOG090X00LI_0035_0035_1 EOG090X00LI_0042_0035_1 0.784 78.4 1013 1 1013 1 963 35 42
11 32 cluster_015764 EOG090X00LI_0042_0035_1 g1726.t1_0035_0042 0.67 58.5 1010 1 963 1 751 42 35
这是我得到的结果:
Unnamed: 0 cluster_name qseqid sseqid pident_x pident_y length qstart qend sstart send qspec sspec
0 13 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93 93.0 1179 1 1175 1 1179 35 42
1 14 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93 93.0 1179 1 1175 1 1179 35 42
但我实际上得到了:
df=pd.read_table("dataframe.txt",header=0,sep='\t')
df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid ','sseqid'])
df.to_csv("df_test",sep='\t')
我使用了这段代码:
allow=autoplay
答案 0 :(得分:1)
我认为numpy.sort
需要drop_duplicates
- 返回已排序的行:
df[['seq_sp1','seq_sp2']] = np.sort(df[['seq_sp1','seq_sp2']], axis=1)
df = df.drop_duplicates(subset=['seq_sp1','seq_sp2'])
print (df)
cluster seq_sp1 seq_sp2
0 1 seq20 seq56
2 2 seq3 seq5
3 3 seq5 seq9
4 3 seq4 seq7
或者DataFrame.duplicated
使用~
进行掩码,mask = pd.DataFrame(np.sort(df[['seq_sp1','seq_sp2']], axis=1), index=df.index).duplicated()
df = df[~mask]
print (df)
cluster seq_sp1 seq_sp2
0 1 seq20 seq56
2 2 seq3 seq5
3 3 seq9 seq5
4 3 seq7 seq4
过滤boolean indexing
- 输出中原始未排序的值:
df = df[['qseqid','sseqid']]
print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
19 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0035_1
20 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0042_1
22 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0035_1
23 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0042_1
df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid','sseqid'])
print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
编辑:
我用新数据测试它:
mask = pd.DataFrame(np.sort(df[['qseqid','sseqid']], axis=1), index=df.index).duplicated()
print (~mask)
13 True
14 True
16 True
17 True
19 False
20 False
22 False
23 False
dtype: bool
df = df[~mask]
print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
<select class="form-control" name='start'>
<option selected disabled hidden value="">start</option>
<option>1</option>
<option>2</option>
<option>3</option>
<option>4</option>
<option>6</option>
<option>7</option>
<option>8</option>
<option>9</option>
<option>10</option>
</select>
<select class="form-control" name='end'>
<option selected disabled hidden value="">end</option>
<option>2</option>
<option>3</option>
<option>4</option>
<option>5</option>
<option>7</option>
<option>8</option>
<option>9</option>
<option>10</option>
<option>11</option>
</select>
答案 1 :(得分:1)
例如:
df_set = df.apply(lambda x: str(sorted(set(x))), 1)
In: df[~df_set.duplicated()]
Out:
seq_sp1 seq_sp2
cluster
1 seq20 seq56
2 seq3 seq5
3 seq9 seq5
3 seq7 seq4
答案 2 :(得分:1)
你可以试试这个:
#sorting rows and joining as string
df["seq_sorted"] = df.apply(lambda row: ",".join(x for x in sorted((row.seq_sp1, row.seq_sp2))), axis=1)
#droping duplicates
df = df.drop_duplicates(subset="seq_sorted").drop(["seq_sorted"], axis=1)
答案 3 :(得分:0)
您可以使用pd.DataFrame.apply
在axis=1
上应用sorted
。然后使用pd.Series.duplicated
删除重复项。
dups = df[['seq_sp1', 'seq_sp2']].apply(sorted, axis=1).duplicated()
res = df[~dups]
print(res)
cluster seq_sp1 seq_sp2
0 1 seq20 seq56
2 2 seq3 seq5
3 3 seq9 seq5
4 3 seq7 seq4