对不起,我很抱歉。有人可以帮我合并两个不同长度的嵌套列表吗?有无数的加入列表的例子" elementwise"在Google和SO上,但它们似乎都没有完全覆盖我的情况。我需要做几千次这样的事情,每个列表长约100万行。
一个列表的格式为:
shortdata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:00:02","1.21034","1.21039","1.21038","1.21037"],
["2015.01.01 22:00:04","1.21032","1.21035","1.21034","1.21034"],
["2015.01.01 22:00:06","1.21021","1.21027","1.21028","1.21028"],
...
["2015.01.01 22:00:56","1.21040","1.21038","1.21039","1.21039"],
["2015.01.01 22:00:58","1.21041","1.21042","1.21047","1.21050"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
["2015.01.01 22:01:02","1.21047","1.21033","1.21035","1.21035"],
["2015.01.01 22:01:04","1.21045","1.21034","1.21036","1.21032"],
...
]
另一个列表的格式为:
longdata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
...
]
我想将子列表连接在一起,以便输出是组合子列表的列表,可能有一些空列填充,例如:
combineddata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:02","1.21034","1.21039","1.21038","1.21037"],
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:04","1.21032","1.21035","1.21034","1.21034"],
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:06","1.21021","1.21027","1.21028","1.21028"],
...
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:56","1.21040","1.21038","1.21039","1.21039"],
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "",["2015.01.01 22:00:58","1.21041","1.21042","1.21047","1.21050"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035", "","", "2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035", "", "", "2015.01.01 22:01:02","1.21047","1.21033","1.21035","1.21035"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035", "", "", "2015.01.01 22:01:04","1.21045","1.21034","1.21036","1.21032"],
...
]
分钟数据'有目的地在每一行上重复,因为逐行计算需要它。
如果我直接列表理解它不起作用,因为列表的长度不同 - 显然2sec数据远远超过1分钟数据。
然后我认为我可以复制1分钟数据的元素,使其与2s数据的长度相同,这样我就可以将两个列表压缩在一起。这也失败了:
expandedlist = [[x] * n for x in longdata]
但我的格式不正确,例如n = 3用于演示(而不是30!):
[[['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038'], ['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038'], ['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038']], [['2015.01.01 22:01:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:01:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:01:00', '1.21037', '1.21037', '1.21037', '1.21037']],
...
所以嵌套有点太多了。我试过删除外部' []'标志,尝试列表(x)而不是[x],并使用外括号'(',其中任何一个都不会导致预期格式的内容被压缩为2s数据。
我想也许我可以使用带有填充值的itertools.izip_longest()并使其填充'带有所需一分钟数据的2s行,如:
combinedlist = list(itertools.izip_longest(longdata, shortdata, fillvalue=<something goes here>))
print combinedlist
我不太了解语法,甚至用简单的字符串填充文件值表明它看起来不像预期的输出。我明白了:
[(['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038'], ['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:01:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:00:02', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:02:00', '1.2105', '1.2105', '1.2105', '1.2105'], ['2015.01.01 22:00:04', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:03:00', '1.21043', '1.21043', '1.21043', '1.21043'], ['2015.01.01 22:00:06', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:04:00', '1.21049', '1.21049', '1.21049', '1.21049'], ['2015.01.01 22:00:08', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:05:00', '1.21043', '1.21043', '1.21038', '1.21038'], ['2015.01.01 22:00:10', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:06:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:00:12', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:07:00', '1.21041', '1.21041', '1.21041', '1.21041'], ['2015.01.01 22:00:14', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:08:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:00:16', '1.21038', '1.21038', '1.21038', '1.21038']), ('foo', ['2015.01.01 22:00:18', '1.21038', '1.21038', '1.21038', '1.21038']), ('foo', ['2015.01.01 22:00:20', '1.21038', '1.21038', '1.21038', '1.21038']), ('foo',...
最后,我认为我可以将所有1分钟数据放入字典中,然后查找2s时间戳的最左边17个字符(例如&#34; 2015.01.01 22:00:& #34;)在字典中进行连接,但这看起来有些麻烦(?)。
我还考虑过二等分方法(即每次在2s数据时间戳中到达&#34; 00&#34;时将分数数据一分为二,但我不确定这是否是最快的方式。
什么是最快(或最优雅)的方式来做我正在尝试做的事情,或者我是否需要写出一个完整的循环来一起加入列表?
非常感谢任何帮助!
亲切的问候,
保
答案 0 :(得分:1)
如果您的短列表和长列表具有n times longer
关系(在您的示例中n
将为30)
即longtdata: [[1],[2]], shortdata: [[1.1],[1.2]...[1.n],[2.1],[2.2],...,[2.n],[3.1]...]
然后你可以通过
来消费短数据expended_data = (x for l in longtdata for x in [l]*n)
或
expended_data = (x for l in longtdata for i in range(n))
,combineddata
变为
combineddata = [a+["",""]+b for a,b in zip(expended_data,shortdata)]
答案 1 :(得分:1)
在迭代第二个数据时,我会在分钟数据中保持一个位置(从0开始)。每当我看到第二个数据中的分钟增量时,我会在分钟数据中增加该位置。然后我会yield
希望的元素:
shortdata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:00:02","1.21034","1.21039","1.21038","1.21037"],
# ...
["2015.01.01 22:00:58","1.21041","1.21042","1.21047","1.21050"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
["2015.01.01 22:01:02","1.21047","1.21033","1.21035","1.21035"],
# ...
]
longdata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
# ...
]
def each_mixed_line(sh, lo):
lo_pos = 0
for sh_line in sh:
while lo_pos < len(lo)-1 and lo[lo_pos+1][0] <= sh_line[0]:
lo_pos += 1
yield lo[lo_pos] + [ '', '' ] + sh_line
for mixed_line in each_mixed_line(shortdata, longdata):
print(mixed_line)
在许多情况下,您不需要构建完整的结果列表,而是可以像print()
一样显示它。这样可以减少内存消耗并因此推荐。但是如果你需要构建结果列表,你可以这样做:
combineddata = list(each_mixed_line(shortdata, longdata))
答案 2 :(得分:0)
如果您不介意更改longdata
变量,则可以使用相应的shortdata
项扩展每个项目,这样更有效,因为它会分配最少的新数据。这是代码:
shortdata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:00:02","1.21034","1.21039","1.21038","1.21037"],
# ...
["2015.01.01 22:00:58","1.21041","1.21042","1.21047","1.21050"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
["2015.01.01 22:01:02","1.21047","1.21033","1.21035","1.21035"],
# ...
]
longdata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
# ...
]
n = 0
end = len(shortdata)
for long in longdata:
prefix = long[0][:16] # keep only significant part
long.clear() # because the first line of 'short' is same as 'long'
while n < end:
short = shortdata[n]
if short[0][:16] != prefix: break
long.extend(short + ['/'])
n += 1
print(longdata)
结果:
[['2015.01.01 22:00:00', '1.21036', '1.21032', '1.21033', '1.21038', '/',
'2015.01.01 22:00:02', '1.21034', '1.21039', '1.21038', '1.21037', '/',
...
'2015.01.01 22:00:58', '1.21041', '1.21042', '1.21047', '1.21050', '/'],
['2015.01.01 22:01:00', '1.21044', '1.21032', '1.21033', '1.21035', '/',
'2015.01.01 22:01:02', '1.21047', '1.21033', '1.21035', '1.21035', '/',
...
'2015.01.01 22:01:58', '1.21041', '1.21042', '1.21047', '1.21050', '/'],
...
]
你也可以用while
上的迭代器替换内部shortdata
,但我不确定它是否真的加速了代码。需要时间......
答案 3 :(得分:0)
如果效率不是一个问题,您可以使用嵌套的for
循环。请注意,这是一个O(n ^ 2)解决方案。
好处是逻辑与数据更紧密地对齐:您正在使用datetime
个对象并明确检查long_date&lt; = short_date&lt; long_date + 1分钟。
from datetime import datetime, timedelta
d = defaultdict(list)
td = timedelta(0, 60)
res = []
for short in shortdata:
s_date = datetime.strptime(short[0], '%Y.%m.%d %H:%M:%S')
for long in longdata:
l_date = datetime.strptime(long[0], '%Y.%m.%d %H:%M:%S')
if l_date <= s_date < l_date + td:
res.append(long + short)