背景:
Table$Gene=Gene1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.928 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 2208 40 0.755 0.00803 0.739 0.771
5 2256 48 0.769 0.00787 0.754 0.784
6 2208 40 0.755 0.00803 0.739 0.771
Table$Gene=Gene2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.938 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 1000 40 0.744 0.00803 0.739 0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.
我想要的东西看起来很简单。我想将上面的文件转换为如下所示的输出:
Gene1 0.755
Gene2 0.744
即。每个基因,每个部分的生存列中的最后一个数字。
我尝试了多种方法,使用正则表达式,以列表形式读取文件并说出" .next()"。我尝试过的一个代码示例:
fileopen = open(sys.argv[1]).readlines() # Read in the file as a list.
for index,line in enumerate(fileopen): # Enumerate items in list
if "Table" in line: # Find the items with "Table" (This will have my gene name)
line2 = line.split("=")[1] # Parse line to get my gene name
if "\n" in fileopen[index+1]: # This is the problem section.
print fileopen[index]
else:
fileopen[index+1]
正如您在问题部分中所看到的,我试图在此尝试中说:
如果列表中的下一个项目是新行,则打印该项目,否则,下一行是当前行(然后我可以拆分该行以提取特定数字I想)。
如果有人能够纠正这些代码,那么我就可以看到我做错了什么,我很感激。
答案 0 :(得分:1)
有点矫枉过正,但不是手动为每个数据项编写解析器,而是使用像pandas这样的现有包来读取csv文件。只需编写一些代码来指定文件中的相关行。未优化的代码(读取文件两次):
import pandas as pd
def genetable(gene):
l = open('gene.txt').readlines()
l += "\n" # add newline to end of file in case last line is not newline
lines = len(l)
skiprows = -1
for (i, line) in enumerate(l):
if "Table$Gene=Gene"+str(gene) in line:
skiprows = i+1
if skiprows>=0 and line=="\n":
skipfooter = lines - i - 1
df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
# assuming tab separated data given your inputs. change as needed
# assert df.columns.....
return df
return "Not Found"
这将在DataFrame中读取该文件中的所有相关数据
然后可以这样做:
genetable(2).survival # series with all survival rates
genetable(2).survival.iloc[-1] last item in survival
这样做的好处是您可以访问所有项目,可能会更好地拾取文件的任何错误格式,并防止使用不正确的值。如果我自己的代码我会在返回pandas DataFrame之前在列名称上添加断言。想要在早期解析时发现任何错误,以免传播。
答案 1 :(得分:0)
这在我尝试时起作用了:
gene = 1
for i in range(len(filelines)):
if filelines[i].strip() == "":
print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
gene += 1
答案 2 :(得分:0)
您可以尝试这样的事情(我将您的数据复制到foo.dat
);
In [1]: with open('foo.dat') as input:
...: lines = input.readlines()
...:
使用with
可确保文件在阅读后关闭。
In [3]: lines = [ln.strip() for ln in lines]
这消除了额外的空白。
In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]
In [6]: startgenes
Out[6]: [0, 10]
In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]
In [8]: emptylines
Out[8]: [9, 17]
使用emptylines
依赖于记录由仅包含空格的行分隔的事实。
In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]
In [10]: for first, last in zip(startgenes, lastlines):
....: gene = lines[first].split("=")[1]
....: num = lines[last].split()[-1]
....: print gene, num
....:
Gene1 0.771
Gene2 0.774
答案 3 :(得分:0)
这是我的解决方案:
>>> with open('t.txt','r') as f:
... for l in f:
... if "Table" in l:
... gene = l.split("=")[1][:-1]
... elif l not in ['\n', '\r\n']:
... surv = l.split()[3]
... else:
... print gene, surv
...
Gene1 0.755
Gene2 0.744
答案 4 :(得分:0)
而不是检查新行,只需在完成阅读文件后打印
lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
if "Table" in line:
if table != "": # print previous survival
print table, finalsurvival
table = line.strip().split('=')[1]
else:
try:
finalsurvival = line.split('\t')[4]
except IndexError:
continue
print table, finalsurvival