我有一个巨大的列表(45M +数据poitns),数值: [78,0,5,150,9000,5,......,25,9,78422 ...]
我可以轻松获取最大值和最小值,这些值的数量以及它们的总和:
file_handle=open('huge_data_file.txt','r')
sum_values=0
min_value=None
max_value=None
for i,line in enumerate(file_handle):
value=int(line[:-1])
if min_value==None or value<min_value:
min_value=value
if max_value==None or value>max_value:
max_value=value
sum_values+=value
average_value=float(sum_values)/i
但是,这不是我需要的。我需要一个包含10个数字的列表,其中每两个连续点之间的数据点数相等,例如
中位数[0,30,120,325,912,1570,2522,5002,7025,78422] 我们在0到30之间或30到120之间的数据点数量接近450万个数据点。 我们怎么做到这一点?
=============================
我很清楚我们需要对数据进行排序。问题是我无法将所有这些数据放在内存中的一个变量中,但是我需要从生成器(file_handle)顺序读取它
答案 0 :(得分:2)
如果您对近似值感到满意,那么这是一个很好的(并且相当容易实现)算法,用于根据流数据计算分位数:Greenwald和Khanna的"Space-Efficient Online Computation of Quantile Summaries"。
答案 1 :(得分:1)
愚蠢的numpy方法:
import numpy as np
# example data (produced by numpy but converted to a simple list)
datalist = list(np.random.randint(0, 10000000, 45000000))
# converted back to numpy array (start here with your data)
arr = np.array(datalist)
np.percentile(arr, 10), np.percentile(arr, 20), np.percentile(arr, 30)
# ref:
# http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html
你也可以在你喜欢的地方一起破解:
arr.sort()
# And then select the 10%, 20% etc value, add some check for equal amount of
# numbers within a bin and then calculate the average, excercise for reader :-)
问题是多次调用此函数会降低它的速度,所以真的,只需对数组进行排序,然后自己选择元素。
答案 2 :(得分:1)
正如您在评论中所说,您希望可以扩展到更大数据集的解决方案然后可以存储在RAM中,将数据提供给SQLlite3数据库。即使您的数据集是10GB而您只有8GB RAM,SQLlite3数据库仍然可以对数据进行排序并按顺序将其返还给您。
SQLlite3数据库为您提供了有关排序数据的生成器。 您可能还希望研究超越python并采取其他一些数据库解决方案。
答案 3 :(得分:1)
这是一个pure-python实现的磁盘分区排序。这是一个缓慢,丑陋的代码,但它有效并且希望每个阶段都相对清晰(合并阶段真的很难看!)。
#!/usr/bin/env python
import os
def get_next_int_from_file(f):
l = f.readline()
if not l:
return None
return int(l.strip())
MAX_SAMPLES_PER_PARTITION = 1000000
PARTITION_FILENAME = "_{}.txt"
# Partition data set
part_id = 0
eof = False
with open("data.txt", "r") as fin:
while not eof:
print "Creating partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "w") as fout:
for _ in range(MAX_SAMPLES_PER_PARTITION):
line = fin.readline()
if not line:
eof = True
break
fout.write(line)
part_id += 1
num_partitions = part_id
# Sort each partition
for part_id in range(num_partitions):
print "Reading unsorted partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "r") as fin:
samples = [int(line.strip()) for line in fin.readlines()]
print "Disk-Deleting unsorted {}".format(part_id)
os.remove(PARTITION_FILENAME.format(part_id))
print "In-memory sorting partition {}".format(part_id)
samples.sort()
print "Writing sorted partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "w") as fout:
fout.writelines(["{}\n".format(sample) for sample in samples])
# Merge-sort the partitions
# NB This is a very inefficient implementation!
print "Merging sorted partitions"
part_files = []
part_next_int = []
num_lines_out = 0
# Setup data structures for the merge
for part_id in range(num_partitions):
fin = open(PARTITION_FILENAME.format(part_id), "r")
next_int = get_next_int_from_file(fin)
if next_int is None:
continue
part_files.append(fin)
part_next_int.append(next_int)
with open("data_sorted.txt", "w") as fout:
while part_files:
# Find the smallest number across all files
min_number = None
min_idx = None
for idx in range(len(part_files)):
if min_number is None or part_next_int[idx] < min_number:
min_number = part_next_int[idx]
min_idx = idx
# Now add that number, and move the relevent file along
fout.write("{}\n".format(min_number))
num_lines_out += 1
if num_lines_out % MAX_SAMPLES_PER_PARTITION == 0:
print "Merged samples: {}".format(num_lines_out)
next_int = get_next_int_from_file(part_files[min_idx])
if next_int is None:
# Remove this partition, it's now finished
del part_files[min_idx:min_idx + 1]
del part_next_int[min_idx:min_idx + 1]
else:
part_next_int[min_idx] = next_int
# Cleanup partition files
for part_id in range(num_partitions):
os.remove(PARTITION_FILENAME.format(part_id))
答案 4 :(得分:1)
我的代码是一个无需太多空间即可查找结果的提案。在测试中,对于大小为45 000 000的数据集,它在7分51秒内找到了分位数值。
from bisect import bisect_left
class data():
def __init__(self, values):
random.shuffle(values)
self.values = values
def __iter__(self):
for i in self.values:
yield i
def __len__(self):
return len(self.values)
def sortedValue(self, percentile):
val = list(self)
val.sort()
num = int(len(self)*percentile)
return val[num]
def init():
numbers = data([x for x in range(1,1000000)])
print(seekPercentile(numbers, 0.1))
print(numbers.sortedValue(0.1))
def seekPercentile(numbers, percentile):
lower, upper = minmax(numbers)
maximum = upper
approx = _approxPercentile(numbers, lower, upper, percentile)
return neighbor(approx, numbers, maximum)
def minmax(list):
minimum = float("inf")
maximum = float("-inf")
for num in list:
if num>maximum:
maximum = num
if num<minimum:
minimum = num
return minimum, maximum
def neighbor(approx, numbers, maximum):
dif = maximum
for num in numbers:
if abs(approx-num)<dif:
result = num
dif = abs(approx-num)
return result
def _approxPercentile(numbers, lower, upper, percentile):
middles = []
less = []
magicNumber = 10000
step = (upper - lower)/magicNumber
less = []
for i in range(1, magicNumber-1):
middles.append(lower + i * step)
less.append(0)
for num in numbers:
index = bisect_left(middles,num)
if index<len(less):
less[index]+= 1
summing = 0
for index, testVal in enumerate(middles):
summing += less[index]
if summing/len(numbers) < percentile:
print(" Change lower from "+str(lower)+" to "+ str(testVal))
lower = testVal
if summing/len(numbers) > percentile:
print(" Change upper from "+str(upper)+" to "+ str(testVal))
upper = testVal
break
precision = 0.01
if (lower+precision)>upper:
return lower
else:
return _approxPercentile(numbers, lower, upper, percentile)
init()
我对我的代码进行了一些编辑,现在我认为这种方式至少可以正常工作,即使它不是最佳的。