目前我在一个文件中有一个很长的序列,我希望将这个序列拆分成较小的子序列,但我希望每个子序列与前一个序列重叠,并将它们放入一个列表中。这是我的意思的一个例子:
(对于神秘的序列道歉,这一切都在一行)
file1.txt
abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft
list1 = ["abcdefessdfekgheithrfkopeifhght", "fhghtryrhfbcvdfersdwtiyuyrterdhc", "erdhcbgjherytyekdnfiwyt", "nfiwytowihfiwoeirehjiwoqpft"]
我目前可以使用以下代码将每个序列拆分成较小的序列,而不会出现重叠:
def chunks(seq, n):
division = len(seq) / float (n)
return [ seq[int(round(division * i)): int(round(division * (i + 1)))] for i in xrange(n) ]
在上面的代码中,n指定列表将被拆分的子序列数。
我正在考虑抓住每个子序列的末尾,并通过硬编码将它们连接到列表中元素的末尾......但这样效率低且难度大。有一个简单的方法吗?
实际上,我需要重叠约100个字符。
谢谢你们
答案 0 :(得分:1)
seq="abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft"
>>> n = 4
>>> overlap = 5
>>> division = len(seq)/n
>>> [seq[i*division:(i+1)*division+overlap] for i in range(n)]
['abcdefessdfekgheithrfkopeifhg', 'eifhghtryrhfbcvdfersdwtiyuyrt', 'yuyrterdhcbgjherytyekdnfiwyto', 'iwytowihfiwoeirehjiwoqpft']
这样做可能会稍微高效一点
>>> [seq[i:i+division+overlap] for i in range(0,n*division,division)]
['abcdefessdfekgheithrfkopeifhg', 'eifhghtryrhfbcvdfersdwtiyuyrt', 'yuyrterdhcbgjherytyekdnfiwyto', 'iwytowihfiwoeirehjiwoqpft']
答案 1 :(得分:1)
如果要将序列seq
拆分为长度为length
的子序列,每个子序列与其后继子之间共享overlap
个字符/元素数:
def split_with_overlap(seq, length, overlap):
return [seq[i:i+length] for i in range(0, len(seq), length - overlap)]
然后根据原始数据对其进行测试:
>>> seq = 'abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft'
>>> split_with_overlap(seq, 31, 5)
['abcdefessdfekgheithrfkopeifhght', 'fhghtryrhfbcvdfersdwtiyuyrterdh', 'terdhcbgjherytyekdnfiwytowihfiw', 'ihfiwoeirehjiwoqpft']