所以我正在处理大型数据集,n> 1000000。该数据包含有关项目的订单信息。 JSON格式的顺序中有一个布尔值,称为ul.traffic_lights {
list-style-image: [url('/images/red_circle.jpg', url('/images/amber_circle.jpg', url('/images/green_circle.jpg');
}
。我想根据布尔值是is_buy_order
还是true
将订单列表分成两个单独的列表。
我想出了一种算法,该算法有缺陷,但比迭代更快。
该算法通过选择枢轴将数据集分成两半,然后检查任一侧以确定哪一侧更靠近过渡点(false
-> false
)。它持续一半,直到枢轴任一侧的值不同或true
都表示没有变化为止。
pivot == 1
以下是严重缩减的数据集的内容:
start = time.time()
orders_file = open("resources/regions/"+x.replace(" ", "")[1:-1]+".json", 'r')
orders = orders_file.readlines()
orders_file.close()
item_buy, item_sell = [], []
pivot_found = False
print(len(orders))
if len(orders) > 1:
while not pivot_found:
temp_orders = orders
pivot = len(temp_orders)//2
if pivot == 1:
break
if json.loads(orders[pivot].replace("\n", ""))["is_buy_order"]:
orders = orders[:pivot]
buy_sell_index -= pivot
else:
orders = orders[pivot:]
if json.loads(temp_orders[pivot].replace("\n", ""))["is_buy_order"] != json.loads(temp_orders[pivot-1].replace("\n", ""))["is_buy_order"]:
pivot_found = True
item_buy, item_sell = temp_orders[:pivot], temp_orders[pivot:]
buy_sell_index = orders.index(item_sell[0])
print(x, time.time()-start, buy_sell_index)
如果数据集需要新的格式来实现,则有可能。
答案 0 :(得分:2)
有一种使用bisect
模块的方法。就其本身而言,它不支持关键功能,但是您可以在列表周围添加一个包装器,以实现以下目的:
from bisect import bisect
my_list = [
{"is_buy_order": False},
{"is_buy_order": False},
{"is_buy_order": False},
{"is_buy_order": False},
{"is_buy_order": True},
{"is_buy_order": True},
{"is_buy_order": True},
{"is_buy_order": True},
{"is_buy_order": True},
{"is_buy_order": True}
]
class KeyFuncWrapper(object):
def __init__(self, iterable, key):
self.it = iterable
self.key = key
def __len__(self):
return len(self.it)
def __getitem__(self, i):
return self.key(self.it[i])
# prints 4
print(bisect(
KeyFuncWrapper(my_list, lambda x: x["is_buy_order"]),
False, # value for bisect to look for
))
之所以可行,是因为bisect将查看KeyFuncWrapper
的第i个元素,而Alternating Least Squares
本身将查看键函数在列表中第i个元素上的应用。
答案 1 :(得分:1)
这确实可以通过简单的二进制搜索完成。
def find_first_buy_order(data):
"""
Performs a binary search on the passed data to find the first buy order.
Parameters
----------
data : array_like
List of order dictionaries
Raises
------
ValueError
When the data is unsorted, or no buy order exists in the data
Returns
-------
int
The index in data of the first buy order
dict
The first buy order
"""
low = 0
high = len(data)
# Check boundary conditions first
if not data or not data[-1]["is_buy_order"]:
raise ValueError("There are no buy orders in the data set!")
if data[0]["is_buy_order"]:
return 0, data[0]
while low != high:
mid = low + (high - low) // 2
previous = data[mid - 1]["is_buy_order"]
current = data[mid]["is_buy_order"]
if previous != current: # current is True, previous is False
return mid, data[mid]
if previous: # previous is True, we need to go left
high = mid
else: # need to go right
low = mid
raise ValueError("Are you sure the data is sorted?")
对于您的数据集(我将其自由转换为词典列表),
>>>idx, value = find_first_buy_order(DATA)
>>>print(idx, value)
<<<3 {'duration': 90, 'min_volume': 1, 'system_id': 30001780, 'type_id': 34, 'location_id': 1027954902335, 'order_id': 5191398100, 'issued': '2018-08-03T01:50:59Z', 'price': 4.0, 'volume_remain': 10000000, 'range': '5', 'is_buy_order': True, 'volume_total': 10000000}