Question

我正在尝试分析此文本文件中的一些信息（其中一个许多我想以相同的格式分析）
.txt文件中有一个表格，其中包含我需要的信息
该表总是有16列，但行数会有所不同
该表包含由管道分隔的列＆＃34; |＆＃34;和由以下分隔的行：＆＃39; + -------- + -------- +＆＃39;
我将文件（.split（＆＃39; + --- +＆＃39;））拆分成一个列表（＆＃39; newlist＆＃39;） element是一行（第1行= newlist [0]）
我切断了文件后表格结束（其中＆＃39; .. image ::＆＃39;是）
现在我要将行拆分为他们的专栏我可以轻松地使用.split（＆＃39; |＆＃39;）
我创建了一些可以很好地工作并且考虑变量的循环行数
def row（）将newlist列入list_i list_i是a list其中每个元素是该行中一个框的内容（使用拆分（＆＃39; |＆＃39;）对于这个特定的测试文件，我可以去第（29）行
- 我对列式数据感兴趣下一个循环创建一个列表使用列信息def column（）查看范围内的所有行（数字行）并为所有这些行提取相同的索引。列（9）将拉行（0）[9]，行（1）[9] ....一直到最后一行
- 我的问题是，这很有效，直到我进入第（9）栏然后它说列表索引超出范围
抱歉，我知道这已被多次询问，但无法弄清楚什么是错的

谢谢！

输入文件：https://drive.google.com/open?id=0B_JDBrcvs5VcRU1ueE5kUlVoYlk

    f = open("999A.txt")


    text_in_file = f.read().strip().split('+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+')
    f.close()

    newlist = []

    for item in text_in_file:
        newlist.append(item)
    matching = [s for s in newlist if ".. image::" in s]

    for item in newlist:
        if newlist.index(item) >= newlist.index(matching[0]):
            newlist.remove(item)


    num_rows = len(newlist) - 1


    def row(i):
        row_i = newlist[i+1]
        list_i = list(row_i.strip().split('|'))
        return list_i[1:17]

    def column(i):
        list_i = []
        for z in range(num_rows):
            list_i.append(row(z)[i])
        return list_i[1:]

    for i in range(30):
        print(row(i))
    print("columns:")
    for i in range(16):
        print(column(i))

Answer 1

该表总是有16列

不正确，您只有8个标题，因此您将在该行上获得索引错误。

|  *L1 barcodes*  |  *L2 barcodes*  |  *L3 barcodes*  |  *L4 barcodes*  |  *L5 barcodes*  |  *L6 barcodes*  |  *L7 barcodes*  |  *L8 barcodes*  |
| CTCTCT | 27.66% | GTTTCG | 9.04%  | NNNNNN | 3.67%  | ATTCGG | 7.41%  | GACGAT | 6.90%  | GAACCC | 13.29% | GTAACA | 9.50%  | ATCGCC | 56.24% |

查看

的示例代码

with open("999A.txt") as f:
    for line in f:
        line = line.strip()
        if line.startswith("|"):
            print line

如果您只想获得具有所需列数的行，那么您需要像这样检查拆分行的长度。

data = []
with open("999A.txt") as f:
    for line in f:
        line = line.strip()
        if line.startswith("|"):
            cols = line.split("|")[1:-1] # remove outside empty strings
            cols = list(map(str.strip, cols)) # strip the remaining strings
            if len(cols) == 16 and not all(x == '' for x in cols):
                # keep rows with 16 columns and no empty strings
                data.append(cols)

for row in data:
    # do something
    print(row)

示例输出

['CTCTCT', '27.66%', 'GTTTCG', '9.04%', 'NNNNNN', '3.67%', 'ATTCGG', '7.41%', 'GACGAT', '6.90%', 'GAACCC', '13.29%', 'GTAACA', '9.50%', 'ATCGCC', '56.24%']
['TGTGTG', '27.54%', 'ATTCCT', '5.78%', 'TTCAGA', '3.11%', 'CGAATC', '6.70%', 'ATTCGG', '6.45%', 'TGCTGT', '13.18%', 'TGCTGT', '8.64%', 'GCTATT', '9.98%']
['ACACAC', '22.70%', 'ATGTCA', '4.47%', 'AGGTTT', '3.01%', 'GACGAT', '6.36%', 'CCATTA', '6.37%', 'TTCAGA', '12.19%', 'CCTGAG', '7.82%', 'CCGAGT', '8.79%']
['GAGAGA', '16.18%', 'GTGGCC', '4.06%', 'CCTGAG', '2.71%', 'GCTATT', '6.26%', 'TTGCCG', '6.23%', 'CCTGAG', '11.42%', 'AAGCTC', '7.77%', 'TAATAG', '5.72%']
['', '', 'GNNTNG', '3.96%', 'GAACCC', '2.47%', 'AGTAGC', '6.11%', 'TAGGCT', '6.14%', 'AGGTTT', '11.39%', 'GAACCC', '7.62%', 'CCATTA', '3.70%']
['', '', 'GTGAAA', '3.47%', '', '', 'CCATTA', '6.10%', 'GCCTAA', '6.07%', 'GTAACA', '11.36%', 'CTTAAA', '7.56%', '', '']

您可能还希望将列表中的每对元素分组以保留最初的8列

那看起来像是这样

...
# keep rows with 16 columns and no empty strings
cols_iter = iter(cols)
data.append(list(zip(cols_iter, cols_iter)))

使用这样的输出

[('CTCTCT', '27.66%'), ('GTTTCG', '9.04%'), ('NNNNNN', '3.67%'), ('ATTCGG', '7.41%'), ('GACGAT', '6.90%'), ('GAACCC', '13.29%'), ('GTAACA', '9.50%'), ('ATCGCC', '56.24%')]
[('TGTGTG', '27.54%'), ('ATTCCT', '5.78%'), ('TTCAGA', '3.11%'), ('CGAATC', '6.70%'), ('ATTCGG', '6.45%'), ('TGCTGT', '13.18%'), ('TGCTGT', '8.64%'), ('GCTATT', '9.98%')]

扩展，你可以打印每个元素

for row in data:
    # do something
    for seq, percent in row:
        if not '' in {seq, percent}:
            print(seq, percent)

输出

CTCTCT 27.66%
GTTTCG 9.04%
NNNNNN 3.67%
ATTCGG 7.41%
GACGAT 6.90%

从循环（python）表中获取信息

1 个答案: