迭代列以查找另一列中的匹配项

时间:2019-04-19 14:34:41

标签: python pandas

我的文件如下:

chr1:92092600   G[chr2:164084669[   ENSG00000189195 ENST00000342818 BTBD8   chr2:164084669
chr1:121498879  T[chr9:2781522[ ENSG00000233432 ENST00000425455 AL592494.2  chr9:2781522
chr2:101298260  ]chr3:196435392]A   ENSG00000163162 ENST00000295317 RNF149  chr3:196435392
chr2:164084669  ]chr1:92092600]G    ENSG00000237844 ENST00000429636 AC016766.1  chr1:92092600
chr9:2781522    ]chr1:121498879]T   ENSG00000080608 ENST00000490444 PUM3    chr1:121498879
chr3:196435392  A[chr2:101298260[   ENSG00000163960 ENST00000296328 UBXN7   chr2:101298260

对于第6列中的每个元素,我想搜索第1列,如果有的话-打印整行。因此,第6列中前3个元素的预期输出应如下所示:

chr2:164084669  ]chr1:92092600]G    ENSG00000237844 ENST00000429636 AC016766.1  chr1:92092600
chr9:2781522    ]chr1:121498879]T   ENSG00000080608 ENST00000490444 PUM3    chr1:121498879
chr3:196435392  A[chr2:101298260[   ENSG00000163960 ENST00000296328 UBXN7   chr2:101298260

到目前为止,我有:

import pandas as pd

pd.options.display.max_colwidth = 100
file =  open("data.txt", 'r')

chrA =[]
chrB = []
Bgenes = []

for line in file.readlines():
    chrA.append(line.split()[0])
    chrB.append(line.split()[5])
    for pos in chrB:
        if pos in chrA: 
            Bgenes.append(line)

4 个答案:

答案 0 :(得分:2)

您还可以使用列表推导来查找匹配项:

with open('data.txt', 'r') as f:
    lines = [line.split() for line in f.readlines()]

for line in lines:
    try:
        i = [x[0] for x in lines].index(line[5])
        print(' '.join(lines[i]))
    except IndexError:
        pass

输出:

chr2:164084669 ]chr1:92092600]G ENSG00000237844 ENST00000429636 AC016766.1 chr1:92092600
chr9:2781522 ]chr1:121498879]T ENSG00000080608 ENST00000490444 PUM3 chr1:121498879
chr3:196435392 A[chr2:101298260[ ENSG00000163960 ENST00000296328 UBXN7 chr2:101298260
chr1:92092600 G[chr2:164084669[ ENSG00000189195 ENST00000342818 BTBD8 chr2:164084669
chr1:121498879 T[chr9:2781522[ ENSG00000233432 ENST00000425455 AL592494.2 chr9:2781522
chr2:101298260 ]chr3:196435392]A ENSG00000163162 ENST00000295317 RNF149 chr3:196435392

答案 1 :(得分:1)

首先将数据放入pandas DataFrame中,然后再使用:

import pandas as pd

df = pd.DataFrame({"a": ["asdf", "qwer", "zxcv"],
                   "b": ["b_row_1", "b_row_2", "b_row_3"],
                   "c": ["ghjk", "qwer", "zxcv"]})

for index, row in df.iterrows():
    if row["c"] not in df["a"].tolist():
        df = df.drop(index)

输出应如下所示:

      a        b     c
1  qwer  b_row_2  qwer
2  zxcv  b_row_3  zxcv

您可以使用类似的方法将文件作为pandas DataFrame读取:

data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]

检查这些链接:

Load data rom txt with pandas

How to iterate over rows in a datarame in pandas

Pandas dataframe drop

答案 2 :(得分:0)

您需要使用一个单独的“ for”循环进行收集,并使用另一个循环进行搜索。

lines=file.readlines()
for line in lines: 
    for line2 in lines:
         if line.split()[5] ==line2.split()[0]:
             Bgenes.append(line2)

我希望这会有所帮助:)

答案 3 :(得分:0)

我假设您的数据可以用逗号分隔(可以添加)。原因是原始数据的空白空间不同。这是结果的代码和屏幕截图,这是您想要的。

import pandas as pd
data1 = pd.read_csv('C:/data.csv', sep=',', header=None)
data2 = pd.read_csv('C:/data.csv', sep=',', header=None)
df1=pd.DataFrame(data1) # create FIRST dataframe
df2=pd.DataFrame(data2) # create SECODN dataframe

df1.columns=['1','2','3','4','5','ID'] #assinging ID to column 6
df2.columns=['ID','2','3','4','5','6'] #assingning ID to column 1

dfMerged1=pd.merge(df1, df2, on='ID', how='inner') 
dfMerged2=pd.merge(df2, dfMerged1, on='ID', how='inner')

dfCleaned=dfMerged2.iloc[:,0:6] #what you want at the end
print(dfCleaned)

enter image description here