将文件中的大型列表打印到多个子列表中,并在python中重叠序列

时间:2011-07-14 01:48:39

标签: python list overlap sequences

目前我在一个文件中有一个很长的序列,我希望将这个序列拆分成较小的子序列,但我希望每个子序列与前一个序列重叠,并将它们放入一个列表中。这是我的意思的一个例子:

(对于神秘的序列道歉,这一切都在一行)

file1.txt
abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft


list1 = ["abcdefessdfekgheithrfkopeifhght", "fhghtryrhfbcvdfersdwtiyuyrterdhc", "erdhcbgjherytyekdnfiwyt", "nfiwytowihfiwoeirehjiwoqpft"]

我目前可以使用以下代码将每个序列拆分成较小的序列,而不会出现重叠:

def chunks(seq, n):
    division = len(seq) / float (n)
        return [ seq[int(round(division * i)): int(round(division * (i + 1)))] for i in xrange(n) ]

在上面的代码中,n指定列表将被拆分的子序列数。

我正在考虑抓住每个子序列的末尾,并通过硬编码将它们连接到列表中元素的末尾......但这样效率低且难度大。有一个简单的方法吗?

实际上,我需要重叠约100个字符。

谢谢你们

2 个答案:

答案 0 :(得分:1)

seq="abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft"
>>> n = 4
>>> overlap = 5
>>> division = len(seq)/n
>>> [seq[i*division:(i+1)*division+overlap] for i in range(n)]
['abcdefessdfekgheithrfkopeifhg', 'eifhghtryrhfbcvdfersdwtiyuyrt', 'yuyrterdhcbgjherytyekdnfiwyto', 'iwytowihfiwoeirehjiwoqpft']

这样做可能会稍微高效一点

>>> [seq[i:i+division+overlap] for i in range(0,n*division,division)]
['abcdefessdfekgheithrfkopeifhg', 'eifhghtryrhfbcvdfersdwtiyuyrt', 'yuyrterdhcbgjherytyekdnfiwyto', 'iwytowihfiwoeirehjiwoqpft']

答案 1 :(得分:1)

如果要将序列seq拆分为长度为length的子序列,每个子序列与其后继子之间共享overlap个字符/元素数:

def split_with_overlap(seq, length, overlap):
    return [seq[i:i+length] for i in range(0, len(seq), length - overlap)]

然后根据原始数据对其进行测试:

>>> seq = 'abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft'

>>> split_with_overlap(seq, 31, 5)
['abcdefessdfekgheithrfkopeifhght', 'fhghtryrhfbcvdfersdwtiyuyrterdh', 'terdhcbgjherytyekdnfiwytowihfiw', 'ihfiwoeirehjiwoqpft']