我有一个很大的1&0和0的列表:
x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].
完整列表here。
我想创建一个新的列表y,其条件是,只有当它们以> = 10以上的顺序出现时才应保留1,否则应该替换这些1#1由零
ex基于x
以上^,y
应该成为:
y = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].
到目前为止,我有以下内容:
import numpy as np
import itertools
nx = np.array(x)
print np.argwhere(np.diff(nx)).squeeze()
answer = []
for key, iter in itertools.groupby(nx):
answer.append((key, len(list(iter))))
print answer
给了我:
[0 3 8 14] # A
[(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B
#A
这意味着更改发生在第0个,第3个等位置之后。
#B
表示有一个1,然后是三个0,然后是五个1,然后是6个零,接着是10个1。
如何继续创建y
的最后一步,我们将根据序列长度将0替换为1?
PS:##我对所有优秀人才的精彩解决方案感到谦卑。
答案 0 :(得分:6)
在迭代分组时进行检查。类似的东西:
>>> from itertools import groupby
>>> x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1]
>>> result = []
>>> for k, g in groupby(x):
... if k:
... g = list(g)
... if len(g) < 10:
... g = len(g)*[0]
... result.extend(g)
...
>>> result
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
请注意,对于此大小的数据集,这比相应的pandas
解决方案更快:
In [11]: from itertools import groupby
In [12]: %%timeit
...: result = []
...: for k, g in groupby(x):
...: if k:
...: g = list(g)
...: if len(g) < 10:
...: g = len(g)*[0]
...: result.extend(g)
...:
181 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [13]: %%timeit s = pd.Series(x)
...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
...:
4.03 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
请注意,这是大熊猫解决方案的慷慨,不计算任何时间从列表转换为pd.Series
或转换回来,包括那些:
In [14]: %%timeit
...: s = pd.Series(x)
...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
...: s = s.tolist()
...:
4.92 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案 1 :(得分:4)
这是另一种笨拙的方法。请注意本文底部的基准:
import numpy as np
import pandas as pd
from itertools import groupby
import re
from timeit import timeit
def f_pp(data):
switches = np.empty((data.size + 1,), bool)
switches[0] = data[0]
switches[-1] = data[-1]
switches[1:-1] = data[:-1]^data[1:]
switches = np.where(switches)[0].reshape(-1, 2)
switches = switches[switches[:, 1]-switches[:, 0] >= 10].ravel()
reps = np.empty((switches.size + 1,), int)
reps[1:-1] = np.diff(switches)
reps[0] = switches[0]
reps[-1] = data.size - switches[-1]
return np.repeat(np.arange(reps.size) & 1, reps)
def f_ja(data):
result = []
for k, g in groupby(data):
if k:
g = list(g)
if len(g) < 10:
g = len(g)*[0]
result.extend(g)
return result
def f_mu(s):
s = s.copy()
s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
return s
def vrange(starts, stops):
stops = np.asarray(stops)
l = stops - starts # Lengths of each range.
return np.repeat(stops - l.cumsum(), l) + np.arange(l.sum())
def f_ka(data):
x = data.copy()
d = np.where(np.diff(x) != 0)[0]
d2 = np.diff(np.concatenate(([0], d, [x.size])))
ind = np.where(d2 >= 10)[0] - 1
x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0
return x
def f_ol(data):
return list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(data)))
n = 10_000
data = np.repeat((np.arange(n) + np.random.randint(2))&1, np.random.randint(1, 20, (n,)))
datal = data.tolist()
datap = pd.Series(data)
kwds = dict(globals=globals(), number=100)
print(np.where(f_ja(datal) != f_pp(data))[0])
print(np.where(f_ol(datal) != f_pp(data))[0])
#print(np.where(f_ka(data) != f_pp(data))[0])
print(np.where(f_mu(datap).values != f_pp(data))[0])
print('itertools.groupby: {:6.3f} ms'.format(10 * timeit('f_ja(datal)', **kwds)))
print('re: {:6.3f} ms'.format(10 * timeit('f_ol(datal)', **kwds)))
#print('numpy Kasramvd: {:6.3f} ms'.format(10 * timeit('f_ka(data)', **kwds)))
print('pandas: {:6.3f} ms'.format(10 * timeit('f_mu(datap)', **kwds)))
print('numpy pp: {:6.3f} ms'.format(10 * timeit('f_pp(data)', **kwds)))
示例输出:
[] # Delta ja, pp
[] # Delta ol, pp
[ 749 750 751 ... 98786 98787 98788] # Delta mu, pp
itertools.groupby: 5.415 ms
re: 28.197 ms
pandas: 14.972 ms
numpy pp: 0.788 ms
只考虑从头开始的解决方案。 @ Olivier的@ juanpa.arrivillaga和我的方法得到了同样的答案,@ MaxU没有。无法让@ Kazramvd完全可靠地完成。 (可能是我的错 - 不知道大熊猫并没有完全理解@ Kazramvd的解决方案。)
请注意,这仅是一个示例,其他条件(如较短列表,更多开关等)可能会更改排名。
答案 2 :(得分:2)
从编码列表 B ,您可以使用列表推导来生成新列表。
b = [(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B
y = sum(([num and int(rep >= 10)] * rep for num, rep in b), [])
re
或者,从一开始这看起来像re
可以做的事情,因为它可以与bytes
一起使用。
import re
x = [1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
y = list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(x)))
两种解决方案输出:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
答案 3 :(得分:2)
如果您想使用Numpy,这是一种矢量化方法:
ind = np.where(np.diff(np.concatenate(([0], np.where(np.diff(x) != 0)[0], [x.size]))) >= 10)[0] - 1
x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0
如果你想使用Python,这是一种在列表理解中使用itertools.chain
,itertools.repeat
和itertools.groupby
的方法:
chain.from_iterable(repeat(0, len(i)) if len(i) >= 10 else i for i in [list(g) for _, g in groupby(x)])
演示:
# Python
In [28]: list(chain.from_iterable(repeat(0, len(i)) if len(i) >= 10 else i for i in [list(g) for _, g in groupby(x)]))
Out[28]: [1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
# Numpy
In [161]: x = np.array([1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1, 0, 0, 1, 1, 1, 1, 1, 1 ,1, 1, 1, 1, 0, 0])
In [162]: d = np.where(np.diff(x) != 0)[0]
In [163]: d2 = np.diff(np.concatenate(([0], d, [x.size])))
In [164]: ind = np.where(d2 >= 10)[0] - 1
In [165]: def vrange(starts, stops):
...: stops = np.asarray(stops)
...: l = stops - starts # Lengths of each range.
...: return np.repeat(stops - l.cumsum(), l) + np.arange(l.sum())
...:
In [166]: x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0
In [167]: x
Out[167]:
array([1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
对于Vrange
我使用了这个答案thread pool,但我认为可能有更优化的方法。
答案 4 :(得分:1)
试试这个:
y = []
for pair in b: ## b is the list which you called #B
add = 0
if pair[0] == 1 and pair[1] > 9:
add = 1
y.extend([add] * pair[1])
答案 5 :(得分:1)
使用熊猫:
import pandas as pd
In [130]: s = pd.Series(x)
In [131]: s
Out[131]:
0 1
1 0
2 0
3 0
4 1
..
20 1
21 1
22 1
23 1
24 1
Length: 25, dtype: int64
In [132]: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
In [133]: s.tolist()
Out[133]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
In [134]: s
Out[134]:
0 0
1 0
2 0
3 0
4 0
..
20 1
21 1
22 1
23 1
24 1
Length: 25, dtype: int64
对于你的“巨大”列表大约需要。在我的旧笔记本上7毫秒:
In [141]: len(x)
Out[141]: 5124
In [142]: %%timeit
...: s = pd.Series(x)
...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
...: res = s.tolist()
...:
6.56 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)