我有一个由7列和约900k行组成的数据集。所有列都是非唯一的,所有值都是整数。
过滤的两个重要条件:
这里的示例是用于基准测试性能的SQL查询:
SELECT DISTINCT
col_2
FROM dataset
WHERE
c_1 in (1,9,5,6,8,18,14,7,15) AND
c_3 in (1) AND
c_4 in (61) AND
c_5 in (3) AND
c_6 in (0) AND
c_7 in (0)
我尝试的第一种方法是使用SQLite的SQL索引,这并没有太糟糕,但是当过滤器返回很多行时性能下降。
然后我在Python中尝试了普通的列表推导。性能比SQL情况稍差。
有没有更好的方法呢?我正在考虑numpy的方向,也许使用比列表和SQL表更高效的数据结构?
我对这里的速度和性能非常感兴趣,而不是效率。
欢迎任何建议!
答案 0 :(得分:2)
关于你所说的每列大约20个左右不同的值,除了400个。如果内存和加载时间不用担心,那么我建议每列每个值创建一组。
这是生成样本数据集的内容。
#!/usr/bin/python
from random import sample, choice
from cPickle import dump
# Generate sample dataset
value_ceiling = 1000
dataset_size = 900000
dataset_filename = 'dataset.pkl'
# number of distinct values per column
col_distrib = [400,20,20,20,20,20,20]
col_values = [ sample(xrange(value_ceiling),x) for x in col_distrib ]
dataset = []
for _ in xrange(dataset_size):
dataset.append(tuple([ choice(x) for x in col_values ]))
dump(dataset,open(dataset_filename,'wb'))
这里有一些东西可以加载数据集,并为每列的每个值创建查找集,搜索方法和样本搜索的创建。
#/usr/bin/python
from random import sample, choice
from cPickle import load
dataset_filename = 'dataset.pkl'
class DataSearch(object):
def __init__(self,filename):
self.data = load(open(filename,'rb'))
self.col_sets = [ dict() for x in self.data[0] ]
self.process_data()
def process_data(self):
for row in self.data:
for i,v in enumerate(row):
self.col_sets[i].setdefault(v,set()).add(row)
def search(self,*args):
# args are integers, sequences of integers, or None in related column positions.
results = []
for i,v in enumerate(args):
if v is None:
continue
elif isinstance(v,int):
results.append(self.col_sets[i].get(v,set()))
else: # sequence
r = [ self.col_sets[i].get(x,set()) for x in v ]
r = reduce(set.union,r[1:],r[0])
results.append(r)
#
results.sort(key=len)
results = reduce(set.intersection,results[1:],results[0])
return results
def sample_search(self,*args):
search = []
for i,v in enumerate(args):
if v is None:
search.append(None)
else:
search.append(sample(self.col_sets[i].keys(),v))
return search
d = DataSearch(dataset_filename)
使用它:
>>> d.search(*d.sample_search(1,1,1,5))
set([(117, 557, 273, 437, 639, 981, 587), (117, 557, 273, 170, 53, 640, 467), (117, 557, 273, 584, 459, 127, 649)])
>>> d.search(*d.sample_search(1,1,1,1))
set([])
>>> d.search(*d.sample_search(10,None,1,1,1,1))
set([(801, 334, 414, 283, 107, 990, 221)])
>>> d.search(*d.sample_search(10,None,1,1,1,1))
set([])
>>> d.search(*d.sample_search(10,None,1,1,1,1))
set([(193, 307, 547, 549, 901, 940, 343)])
>>> import timeit
>>> timeit.Timer('d.search(*d.sample_search(10,None,1,1,1,1))','from __main__ import d').timeit(100)
1.787431001663208
1.8秒,足够快地进行100次搜索?
答案 1 :(得分:0)
以下是我提出的建议:
513 $ cat filtarray.py
#!/usr/bin/python2
#
import numpy
import itertools
a = numpy.fromiter(xrange(7*900000), int)
a.shape = (900000,7)
# stuff a known match
a[33][0] = 18
a[33][2] = 1
a[33][3] = 61
# filter it, and make list, but that is not strictly necessary.
res = list(itertools.ifilter(lambda r: r[0] in (1,9,5,6,8,18,14,7,15) and r[2] == 1 and r[3] == 61, a))
print res
在Intel E8400上运行:
512 $ time python filtarray.py
[array([ 18, 232, 1, 61, 235, 236, 237])]
python filtarray.py 5.36s user 0.05s system 99% cpu 5.418 total
那更快吗?
答案 2 :(得分:0)
这是一个numpy版本,大约需要1秒钟。
x = numpy.random.randint(0, 100, (7, 900000))
def filter(data, filters):
indices = []
for i, filter in enumerate(filters):
indices.append(numpy.any([data[i] == x for x in filter], 0))
indices = numpy.all(indices, 0)
return data[indices]
# Usage:
filter(x, [(1,9,5,6,8,18,14,7,15), (1,), (61,), (3,), (0,), (0,)])
%timeit filter(x, [(1,9,5,6,8,18,14,7,15), (1,), (61,), (3,), (0,), (0,)])
1 loops, best of 3: 903 ms per loop