我有2个文本文件,例如以下示例。我将其中一个命名为first
(comma separated
),将另一个命名为second
(tab separated
)。
first
:
chr1,105000000,105310000,2,1,3,2
chr1,5310000,5960000,2,1,5,4
chr1,1580000,1180000,4,1,5,3
chr19,107180000,107680000,1,1,5,4
chr1,7680000,8300000,3,1,1,2
chr1,109220000,110070000,4,2,3,3
chr1,11060000,12070000,6,2,7,4
second
:
AKAP8L chr19 107180100 107650000 transcript
AKAP8L chr19 15514130 15529799 transcript
AKIRIN2 chr6 88384790 88411927 transcript
AKIRIN2 chr6 88410228 88411243 transcript
AKT3 chr1 105002000 105010000 transcript
AKT3 chr1 243663021 244006886 transcript
AKT3 chr1 243665065 244013430 transcript
第一个文件列2
和3
中的是开始和结束。第二个文件列中的3
和4
分别是开始和结束。我想从第一个和第二个文件中创建一个新的文本文件。
在新文件中,我要根据以下条件(3列)计算file second
中与file first
中每一行相匹配的行数:
1- the 1st column in file first is equal to 2nd column in file second.
2- the 3rd column in the file second is greater than the the 2nd column in the file first and also smaller than the 3rd column in the file first.
3- the 4th column in the file second should be also greater than the the 2nd column in the file first and also smaller than the 3rd column in the file first.
实际上,输出看起来像预期的输出。前7列直接来自file first
,而第9列是file second
中与file first
中的每一行相匹配的行数(基于上述3条标准)。并且8th column
将是“ file second
中该行的第一列,该行首先与文件的特定行匹配”
expected output
:
chr19,107180000,107680000,1,1,5,4,AKAP8L, 1
chr1,105000000,105310000,2,1,3,2, AKT3, 1
我试图用python做到这一点并编写了这段代码,但是它没有返回我想要的东西。
first = open('first.csv', 'rb')
second = open('second.txt', 'rb')
first_file = []
for line in first:
first_file.append(line.split(','))
second_file = []
for line2 in second:
second_file.append(line.split())
count=0
final = []
for i in range(len(first_file)):
for j in range(len(second_file)):
first_row = first_file[i]
second_row = second_file[j]
first_col = first_row.split()
second_col = second_row.split()
if first_col[0] == second_col[1] and first_col[1] < second_col[2] < first_col[2] and first_col[1] < second_col[3] < first_col[2]
count+=1
final.append(first_col[i]+second_col[0]+count)
答案 0 :(得分:2)
鉴于您没有列名,因此看起来确实很健壮,但是它可以工作并且使用pandas
:
import pandas as pd
first = 'first.csv'
second = 'second.txt'
df1 = pd.read_csv(first, header=None)
df2 = pd.read_csv(second, sep='\s+', header=None)
merged = df1.merge(df2, left_on=[0], right_on=[1], suffixes=('first', 'second'))
a, b, c, d = merged['2second'], merged['1first'], merged['2first'], merged['3second']
cleaned = merged[(c>a)&(a>b)&(c>d)&(d>b)]
counted = cleaned.groupby(['0first', '1first', '2first', '3first', '4first', 5, 6, '0second'])['4second'].count().reset_index()
counted.to_csv('result.csv', index=False, header=False)
这将产生具有以下内容的result.csv
:
chr1,105000000,105310000,2,1,3,2,AKT3,1
chr19,107180000,107680000,1,1,5,4,AKAP8L,1
答案 1 :(得分:0)
在相同的设置下,如果您执行以下操作,则它将起作用。
posts: Post[];
这将产生与您想要的结果相同的结果。
first = open('first.csv', 'r')
second = open('second.txt', 'r')
first_file = []
for line in first:
first_file.append(line.strip())
second_file = []
for line2 in second:
second_file.append(line2)
count=0
final = []
for i in range(len(first_file)):
for j in range(len(second_file)):
first_row = first_file[i]
second_row = second_file[j]
first_col = first_row.split(',')
second_col = second_row.split()
if (first_col[0] == second_col[1]) and (first_col[1] < second_col[2] < first_col[2]) and (first_col[1] < second_col[3] < first_col[2]):
count = count + 1
final.append(first_row +','+second_col[0]+',' + str(count))
print(final)