Question

示例如下所示：

lst = ['ms 20 3 -s 10 \n', '17954 11302 58011\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0706 0.2241 0.2575 0.889 \n', '0001000010\n', '0101000010\n', '0101010010\n', '0001000010\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0038 0.1622 0.1972 \n', '0110000110\n', '1001001000\n', '0010000110\n', '$$\n', 'segsites: 10\n', 'positions: 0.0155 0.0779 0.2092 \n', '0000001011\n', '0000001011\n', '0000001011\n']

每个新集都以$$开头。我需要解析数据，以便我有以下列表列表。

sample = [['0001000010', '0101000010', '0101010010', '0001000010'],['0110000110', '1001001000', '0010000110'],['0000001011', '0000001011', '0000001011'] # Required Output

尝试代码

sample =[[]]
sample1 = ""
seqlist = []

for line in lst: 
    if line.startswith("$$"):
        if line in '01': #Line contains only 0's or 1
          sample1.append(line) #Append each line that with 1 and 0's in a string one after another
    sample.append(sample1.strip()) #Do this or last line is lost
print sample

Output:[[], '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

我是解析数据的新手，并试图弄清楚如何做到这一点。关于如何修改代码以及解释的建议表示赞赏。

Answer 1

您的问题（至少）在这里：if line in '01'。

此行表示if line == '0' or line == '1'，绝对不是您想要的。

一种基本但有效的方法，就是测试每个字符串，如果它只由0和1组成：

def is_binary(string) :
    for c in string :
        if c not in '01' :
            return False
    return True

如果True 可以解释为二进制值，则此函数返回string，如果不是，则False。

当然，你必须管理这个＆＃39; \ n＆＃39;最后，但你有主要想法;）

Answer 2

我是按照以下方式做的：

import re

lst = ['ms 20 3 -s 10 \n', '17954 11302 58011\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0706 0.2241 0.2575 0.889 \n', '0001000010\n', '0101000010\n', '0101010010\n', '0001000010\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0038 0.1622 0.1972 \n', '0110000110\n', '1001001000\n', '0010000110\n', '$$\n', 'segsites: 10\n', 'positions: 0.0155 0.0779 0.2092 \n', '0000001011\n', '0000001011\n', '0000001011\n']

result = []
curr_group = []
for item in lst:
    item = item.rstrip() # Remove \n
    if '$$' in item:
        if len(curr_group) > 0: # Check to see if binary numbers have been found.
            result.append(curr_group)
            curr_group = []
    elif re.match('[01]+$', item): # Checks to see if string is binary (0s or 1s).
        curr_group.append(item)

result.append(curr_group) # Appends final group due to lack of ending '$$'. 

print(result)

基本上，您希望迭代这些项目，直到找到'$$'，然后将您之前找到的任何二进制字符添加到最终结果中，然后开始一个新组。您找到的每个二进制字符串（使用正则表达式）都应添加到当前组中。

最后，您需要添加最后一组二进制数，因为没有尾随'$$'

如何使用二进制元素将数据解析为Python中的列表列表？

2 个答案: