加入两个嵌套在python 2.7

时间:2018-04-17 13:55:43

标签: python python-2.7 time-series list-comprehension itertools

对不起,我很抱歉。有人可以帮我合并两个不同长度的嵌套列表吗?有无数的加入列表的例子" elementwise"在Google和SO上,但它们似乎都没有完全覆盖我的情况。我需要做几千次这样的事情,每个列表长约100万行。

一个列表的格式为:

shortdata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:00:02","1.21034","1.21039","1.21038","1.21037"],
["2015.01.01 22:00:04","1.21032","1.21035","1.21034","1.21034"],
["2015.01.01 22:00:06","1.21021","1.21027","1.21028","1.21028"],
...
["2015.01.01 22:00:56","1.21040","1.21038","1.21039","1.21039"],
["2015.01.01 22:00:58","1.21041","1.21042","1.21047","1.21050"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
["2015.01.01 22:01:02","1.21047","1.21033","1.21035","1.21035"],
["2015.01.01 22:01:04","1.21045","1.21034","1.21036","1.21032"],
...
]

另一个列表的格式为:

longdata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
...
]

我想将子列表连接在一起,以便输出是组合子列表的列表,可能有一些空列填充,例如:

combineddata = [
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:02","1.21034","1.21039","1.21038","1.21037"],
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:04","1.21032","1.21035","1.21034","1.21034"],
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:06","1.21021","1.21027","1.21028","1.21028"],
...
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "", "2015.01.01 22:00:56","1.21040","1.21038","1.21039","1.21039"],
["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038", "", "",["2015.01.01 22:00:58","1.21041","1.21042","1.21047","1.21050"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035", "","", "2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035", "", "", "2015.01.01 22:01:02","1.21047","1.21033","1.21035","1.21035"],
["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035", "", "", "2015.01.01 22:01:04","1.21045","1.21034","1.21036","1.21032"],
...
]

分钟数据'有目的地在每一行上重复,因为逐行计算需要它。

如果我直接列表理解它不起作用,因为列表的长度不同 - 显然2sec数据远远超过1分钟数据。

然后我认为我可以复制1分钟数据的元素,使其与2s数据的长度相同,这样我就可以将两个列表压缩在一起。这也失败了:

expandedlist = [[x] * n for x in longdata]

但我的格式不正确,例如n = 3用于演示(而不是30!):

[[['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038'], ['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038'], ['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038']], [['2015.01.01 22:01:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:01:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:01:00', '1.21037', '1.21037', '1.21037', '1.21037']], 
...

所以嵌套有点太多了。我试过删除外部' []'标志,尝试列表(x)而不是[x],并使用外括号'(',其中任何一个都不会导致预期格式的内容被压缩为2s数据。

我想也许我可以使用带有填充值的itertools.izip_longest()并使其填充'带有所需一分钟数据的2s行,如:

combinedlist = list(itertools.izip_longest(longdata, shortdata, fillvalue=<something goes here>))
print combinedlist

我不太了解语法,甚至用简单的字符串填充文件值表明它看起来不像预期的输出。我明白了:

[(['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038'], ['2015.01.01 22:00:00', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:01:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:00:02', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:02:00', '1.2105', '1.2105', '1.2105', '1.2105'], ['2015.01.01 22:00:04', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:03:00', '1.21043', '1.21043', '1.21043', '1.21043'], ['2015.01.01 22:00:06', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:04:00', '1.21049', '1.21049', '1.21049', '1.21049'], ['2015.01.01 22:00:08', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:05:00', '1.21043', '1.21043', '1.21038', '1.21038'], ['2015.01.01 22:00:10', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:06:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:00:12', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:07:00', '1.21041', '1.21041', '1.21041', '1.21041'], ['2015.01.01 22:00:14', '1.21038', '1.21038', '1.21038', '1.21038']), (['2015.01.01 22:08:00', '1.21037', '1.21037', '1.21037', '1.21037'], ['2015.01.01 22:00:16', '1.21038', '1.21038', '1.21038', '1.21038']), ('foo', ['2015.01.01 22:00:18', '1.21038', '1.21038', '1.21038', '1.21038']), ('foo', ['2015.01.01 22:00:20', '1.21038', '1.21038', '1.21038', '1.21038']), ('foo',...

最后,我认为我可以将所有1分钟数据放入字典中,然后查找2s时间戳的最左边17个字符(例如&#34; 2015.01.01 22:00:& #34;)在字典中进行连接,但这看起来有些麻烦(?)。

我还考虑过二等分方法(即每次在2s数据时间戳中到达&#34; 00&#34;时将分数数据一分为二,但我不确定这是否是最快的方式。

什么是最快(或最优雅)的方式来做我正在尝试做的事情,或者我是否需要写出一个完整的循环来一起加入列表?

非常感谢任何帮助!

亲切的问候,

4 个答案:

答案 0 :(得分:1)

如果您的短列表和长列表具有n times longer关系(在您的示例中n将为30)

longtdata: [[1],[2]], shortdata: [[1.1],[1.2]...[1.n],[2.1],[2.2],...,[2.n],[3.1]...]

然后你可以通过

来消费短数据
expended_data = (x for l in longtdata for x in [l]*n)

expended_data = (x for l in longtdata for i in range(n))

combineddata变为

combineddata = [a+["",""]+b for a,b in zip(expended_data,shortdata)]

答案 1 :(得分:1)

在迭代第二个数据时,我会在分钟数据中保持一个位置(从0开始)。每当我看到第二个数据中的分钟增量时,我会在分钟数据中增加该位置。然后我会yield希望的元素:

shortdata = [
  ["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
  ["2015.01.01 22:00:02","1.21034","1.21039","1.21038","1.21037"],
  # ...
  ["2015.01.01 22:00:58","1.21041","1.21042","1.21047","1.21050"],
  ["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
  ["2015.01.01 22:01:02","1.21047","1.21033","1.21035","1.21035"],
  # ...
]

longdata = [
  ["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
  ["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
  # ...
]

def each_mixed_line(sh, lo):
  lo_pos = 0
  for sh_line in sh:
    while lo_pos < len(lo)-1 and lo[lo_pos+1][0] <= sh_line[0]:
      lo_pos += 1
    yield lo[lo_pos] + [ '', '' ] + sh_line

for mixed_line in each_mixed_line(shortdata, longdata):
  print(mixed_line)

在许多情况下,您不需要构建完整的结果列表,而是可以像print()一样显示它。这样可以减少内存消耗并因此推荐。但是如果你需要构建结果列表,你可以这样做:

combineddata = list(each_mixed_line(shortdata, longdata))

答案 2 :(得分:0)

如果您不介意更改longdata变量,则可以使用相应的shortdata项扩展每个项目,这样更有效,因为它会分配最少的新数据。这是代码:

shortdata = [
  ["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
  ["2015.01.01 22:00:02","1.21034","1.21039","1.21038","1.21037"],
  # ...
  ["2015.01.01 22:00:58","1.21041","1.21042","1.21047","1.21050"],
  ["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
  ["2015.01.01 22:01:02","1.21047","1.21033","1.21035","1.21035"],
  # ...
]

longdata = [
  ["2015.01.01 22:00:00","1.21036","1.21032","1.21033","1.21038"],
  ["2015.01.01 22:01:00","1.21044","1.21032","1.21033","1.21035"],
  # ...
]

n = 0
end = len(shortdata)
for long in longdata:
   prefix = long[0][:16]  # keep only significant part
   long.clear()           # because the first line of 'short' is same as 'long'
   while n < end:
     short = shortdata[n]
     if short[0][:16] != prefix: break
     long.extend(short + ['/'])
     n += 1
print(longdata)

结果:

[['2015.01.01 22:00:00', '1.21036', '1.21032', '1.21033', '1.21038', '/', 
  '2015.01.01 22:00:02', '1.21034', '1.21039', '1.21038', '1.21037', '/',
  ... 
  '2015.01.01 22:00:58', '1.21041', '1.21042', '1.21047', '1.21050', '/'], 
 ['2015.01.01 22:01:00', '1.21044', '1.21032', '1.21033', '1.21035', '/', 
  '2015.01.01 22:01:02', '1.21047', '1.21033', '1.21035', '1.21035', '/',
  ...
  '2015.01.01 22:01:58', '1.21041', '1.21042', '1.21047', '1.21050', '/'],
 ...
]

你也可以用while上的迭代器替换内部shortdata,但我不确定它是否真的加速了代码。需要时间......

答案 3 :(得分:0)

如果效率是一个问题,您可以使用嵌套的for循环。请注意,这是一个O(n ^ 2)解决方案。

好处是逻辑与数据更紧密地对齐:您正在使用datetime个对象并明确检查long_date&lt; = short_date&lt; long_date + 1分钟。

from  datetime import datetime, timedelta

d = defaultdict(list)
td = timedelta(0, 60)
res = []

for short in shortdata:
    s_date = datetime.strptime(short[0], '%Y.%m.%d %H:%M:%S')
    for long in longdata:
        l_date = datetime.strptime(long[0], '%Y.%m.%d %H:%M:%S')
        if l_date <= s_date < l_date + td:
            res.append(long + short)