我的文件如下:
chr1:92092600 G[chr2:164084669[ ENSG00000189195 ENST00000342818 BTBD8 chr2:164084669
chr1:121498879 T[chr9:2781522[ ENSG00000233432 ENST00000425455 AL592494.2 chr9:2781522
chr2:101298260 ]chr3:196435392]A ENSG00000163162 ENST00000295317 RNF149 chr3:196435392
chr2:164084669 ]chr1:92092600]G ENSG00000237844 ENST00000429636 AC016766.1 chr1:92092600
chr9:2781522 ]chr1:121498879]T ENSG00000080608 ENST00000490444 PUM3 chr1:121498879
chr3:196435392 A[chr2:101298260[ ENSG00000163960 ENST00000296328 UBXN7 chr2:101298260
对于第6列中的每个元素,我想搜索第1列,如果有的话-打印整行。因此,第6列中前3个元素的预期输出应如下所示:
chr2:164084669 ]chr1:92092600]G ENSG00000237844 ENST00000429636 AC016766.1 chr1:92092600
chr9:2781522 ]chr1:121498879]T ENSG00000080608 ENST00000490444 PUM3 chr1:121498879
chr3:196435392 A[chr2:101298260[ ENSG00000163960 ENST00000296328 UBXN7 chr2:101298260
到目前为止,我有:
import pandas as pd
pd.options.display.max_colwidth = 100
file = open("data.txt", 'r')
chrA =[]
chrB = []
Bgenes = []
for line in file.readlines():
chrA.append(line.split()[0])
chrB.append(line.split()[5])
for pos in chrB:
if pos in chrA:
Bgenes.append(line)
答案 0 :(得分:2)
您还可以使用列表推导来查找匹配项:
with open('data.txt', 'r') as f:
lines = [line.split() for line in f.readlines()]
for line in lines:
try:
i = [x[0] for x in lines].index(line[5])
print(' '.join(lines[i]))
except IndexError:
pass
输出:
chr2:164084669 ]chr1:92092600]G ENSG00000237844 ENST00000429636 AC016766.1 chr1:92092600
chr9:2781522 ]chr1:121498879]T ENSG00000080608 ENST00000490444 PUM3 chr1:121498879
chr3:196435392 A[chr2:101298260[ ENSG00000163960 ENST00000296328 UBXN7 chr2:101298260
chr1:92092600 G[chr2:164084669[ ENSG00000189195 ENST00000342818 BTBD8 chr2:164084669
chr1:121498879 T[chr9:2781522[ ENSG00000233432 ENST00000425455 AL592494.2 chr9:2781522
chr2:101298260 ]chr3:196435392]A ENSG00000163162 ENST00000295317 RNF149 chr3:196435392
答案 1 :(得分:1)
首先将数据放入pandas DataFrame中,然后再使用:
import pandas as pd
df = pd.DataFrame({"a": ["asdf", "qwer", "zxcv"],
"b": ["b_row_1", "b_row_2", "b_row_3"],
"c": ["ghjk", "qwer", "zxcv"]})
for index, row in df.iterrows():
if row["c"] not in df["a"].tolist():
df = df.drop(index)
输出应如下所示:
a b c
1 qwer b_row_2 qwer
2 zxcv b_row_3 zxcv
您可以使用类似的方法将文件作为pandas DataFrame读取:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]
检查这些链接:
答案 2 :(得分:0)
您需要使用一个单独的“ for”循环进行收集,并使用另一个循环进行搜索。
lines=file.readlines()
for line in lines:
for line2 in lines:
if line.split()[5] ==line2.split()[0]:
Bgenes.append(line2)
我希望这会有所帮助:)
答案 3 :(得分:0)
我假设您的数据可以用逗号分隔(可以添加)。原因是原始数据的空白空间不同。这是结果的代码和屏幕截图,这是您想要的。
import pandas as pd
data1 = pd.read_csv('C:/data.csv', sep=',', header=None)
data2 = pd.read_csv('C:/data.csv', sep=',', header=None)
df1=pd.DataFrame(data1) # create FIRST dataframe
df2=pd.DataFrame(data2) # create SECODN dataframe
df1.columns=['1','2','3','4','5','ID'] #assinging ID to column 6
df2.columns=['ID','2','3','4','5','6'] #assingning ID to column 1
dfMerged1=pd.merge(df1, df2, on='ID', how='inner')
dfMerged2=pd.merge(df2, dfMerged1, on='ID', how='inner')
dfCleaned=dfMerged2.iloc[:,0:6] #what you want at the end
print(dfCleaned)