相关问题:
我有非常大的数据集(超过500万件商品),我需要从中获取 N个最大项目。最自然的方法是使用堆/优先级队列仅存储前N个项。 JVM(Scala / Java)的优先级队列有几个很好的实现,即:
前2个很好,但它们存储了所有项目,在我的情况下会产生关键的内存开销。第三个(Lucene实现)没有这样的缺点,但正如我从文档中看到的那样,它也不支持自定义比较器,这对我来说没用。
所以,我的问题是:是否 PriorityQueue
实施 固定容量和自定义比较器?
UPD。最后,根据彼得的回答,我创建了自己的实现:
public class FixedSizePriorityQueue<E> extends TreeSet<E> {
private int elementsLeft;
public FixedSizePriorityQueue(int maxSize) {
super(new NaturalComparator());
this.elementsLeft = maxSize;
}
public FixedSizePriorityQueue(int maxSize, Comparator<E> comparator) {
super(comparator);
this.elementsLeft = maxSize;
}
/**
* @return true if element was added, false otherwise
* */
@Override
public boolean add(E e) {
if (elementsLeft == 0 && size() == 0) {
// max size was initiated to zero => just return false
return false;
} else if (elementsLeft > 0) {
// queue isn't full => add element and decrement elementsLeft
boolean added = super.add(e);
if (added) {
elementsLeft--;
}
return added;
} else {
// there is already 1 or more elements => compare to the least
int compared = super.comparator().compare(e, this.first());
if (compared == 1) {
// new element is larger than the least in queue => pull the least and add new one to queue
pollFirst();
super.add(e);
return true;
} else {
// new element is less than the least in queue => return false
return false;
}
}
}
}
(其中NaturalComparator
来自this问题)
答案 0 :(得分:16)
你怎么说Lucene不支持自定义比较器?
它的摘要,您必须实现抽象方法lessThan(T a, T b)
答案 1 :(得分:14)
虽然是一个古老的问题,但它可能对其他人有帮助。 您可以使用Google的Java库番石榴minMaxPriorityQueue。
答案 2 :(得分:12)
您可以使用SortedSet,例如TreeSet使用自定义比较器,当大小达到N时删除最小值。
答案 3 :(得分:4)
我无法想到一个现成的,但您可以检查此集合中my implementation的类似要求。
不同之处在于比较器,但如果你从PriorityQueue
延伸,你就会拥有它。并且在每次添加检查时是否未达到限制,如果有 - 删除最后一项。
答案 4 :(得分:4)
以下是我之前使用的实现。符合彼得的建议。
public @interface NonThreadSafe {
}
/**
* A priority queue implementation with a fixed size based on a {@link TreeMap}.
* The number of elements in the queue will be at most {@code maxSize}.
* Once the number of elements in the queue reaches {@code maxSize}, trying to add a new element
* will remove the greatest element in the queue if the new element is less than or equal to
* the current greatest element. The queue will not be modified otherwise.
*/
@NonThreadSafe
public static class FixedSizePriorityQueue<E> {
private final TreeSet<E> treeSet; /* backing data structure */
private final Comparator<? super E> comparator;
private final int maxSize;
/**
* Constructs a {@link FixedSizePriorityQueue} with the specified {@code maxSize}
* and {@code comparator}.
*
* @param maxSize - The maximum size the queue can reach, must be a positive integer.
* @param comparator - The comparator to be used to compare the elements in the queue, must be non-null.
*/
public FixedSizePriorityQueue(final int maxSize, final Comparator<? super E> comparator) {
super();
if (maxSize <= 0) {
throw new IllegalArgumentException("maxSize = " + maxSize + "; expected a positive integer.");
}
if (comparator == null) {
throw new NullPointerException("Comparator is null.");
}
this.treeSet = new TreeSet<E>(comparator);
this.comparator = treeSet.comparator();
this.maxSize = maxSize;
}
/**
* Adds an element to the queue. If the queue contains {@code maxSize} elements, {@code e} will
* be compared to the greatest element in the queue using {@code comparator}.
* If {@code e} is less than or equal to the greatest element, that element will be removed and
* {@code e} will be added instead. Otherwise, the queue will not be modified
* and {@code e} will not be added.
*
* @param e - Element to be added, must be non-null.
*/
public void add(final E e) {
if (e == null) {
throw new NullPointerException("e is null.");
}
if (maxSize <= treeSet.size()) {
final E firstElm = treeSet.first();
if (comparator.compare(e, firstElm) < 1) {
return;
} else {
treeSet.pollFirst();
}
}
treeSet.add(e);
}
/**
* @return Returns a sorted view of the queue as a {@link Collections#unmodifiableList(java.util.List)}
* unmodifiableList.
*/
public List<E> asList() {
return Collections.unmodifiableList(new ArrayList<E>(treeSet));
}
}
我很感激任何反馈。
编辑:看起来使用TreeSet
似乎效率不高,因为对first()
的调用似乎需要次线性时间。我将TreeSet
更改为PriorityQueue
。修改后的add()
方法如下所示:
/**
* Adds an element to the queue. If the queue contains {@code maxSize} elements, {@code e} will
* be compared to the lowest element in the queue using {@code comparator}.
* If {@code e} is greater than or equal to the lowest element, that element will be removed and
* {@code e} will be added instead. Otherwise, the queue will not be modified
* and {@code e} will not be added.
*
* @param e - Element to be added, must be non-null.
*/
public void add(final E e) {
if (e == null) {
throw new NullPointerException("e is null.");
}
if (maxSize <= priorityQueue.size()) {
final E firstElm = priorityQueue.peek();
if (comparator.compare(e, firstElm) < 1) {
return;
} else {
priorityQueue.poll();
}
}
priorityQueue.add(e);
}
答案 5 :(得分:2)
正是我在寻找的东西。但是,实现包含一个错误:
即:如果elementsLeft&gt; 0和e已包含在TreeSet中。 在这种情况下,elementsLeft减少了,但TreeSet中的元素数保持不变。
我建议用
替换add()方法中的相应行 } else if (elementsLeft > 0) {
// queue isn't full => add element and decrement elementsLeft
boolean added = super.add(e);
if (added) {
elementsLeft--;
}
return added;
答案 6 :(得分:1)
试试这段代码:
public class BoundedPQueue<E extends Comparable<E>> {
/**
* Lock used for all public operations
*/
private final ReentrantLock lock;
PriorityBlockingQueue<E> queue ;
int size = 0;
public BoundedPQueue(int capacity){
queue = new PriorityBlockingQueue<E>(capacity, new CustomComparator<E>());
size = capacity;
this.lock = new ReentrantLock();
}
public boolean offer(E e) {
final ReentrantLock lock = this.lock;
lock.lock();
E vl = null;
if(queue.size()>= size) {
vl= queue.poll();
if(vl.compareTo(e)<0)
e=vl;
}
try {
return queue.offer(e);
} finally {
lock.unlock();
}
}
public E poll() {
return queue.poll();
}
public static class CustomComparator<E extends Comparable<E>> implements Comparator<E> {
@Override
public int compare(E o1, E o2) {
//give me a max heap
return o1.compareTo(o2) *-1;
}
}
}
答案 7 :(得分:1)
如果你有番石榴,这是我放在一起的。我认为它非常完整。如果我错过了什么,请告诉我。
您可以使用gauva ForwardingBlockingQueue,这样您就不必映射所有其他方法。
import com.google.common.util.concurrent.ForwardingBlockingQueue;
public class PriorityBlockingQueueDecorator<E> extends
ForwardingBlockingQueue<E> {
public static final class QueueFullException extends IllegalStateException {
private static final long serialVersionUID = -9218216017510478441L;
}
private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
private int maxSize;
private PriorityBlockingQueue<E> delegate;
public PriorityBlockingQueueDecorator(PriorityBlockingQueue<E> delegate) {
this(MAX_ARRAY_SIZE, delegate);
}
public PriorityBlockingQueueDecorator(int maxSize,
PriorityBlockingQueue<E> delegate) {
this.maxSize = maxSize;
this.delegate = delegate;
}
@Override
protected BlockingQueue<E> delegate() {
return delegate;
}
@Override
public boolean add(E element) {
return offer(element);
}
@Override
public boolean addAll(Collection<? extends E> collection) {
boolean modified = false;
for (E e : collection)
if (add(e))
modified = true;
return modified;
}
@Override
public boolean offer(E e, long timeout, TimeUnit unit)
throws InterruptedException {
return offer(e);
}
@Override
public boolean offer(E o) {
if (maxSize > size()) {
throw new QueueFullException();
}
return super.offer(o);
}
}
答案 8 :(得分:1)
嗯,一个很老的问题,但我很困惑为什么还没有提出更简单的解决方案。
除非我遗漏了一些东西,否则可以使用 min-heap(Java 的默认 PriorityQueue 实现) 轻松解决这个问题,稍有改动,因为 PriorityQueue 的大小变得大于 k (即,如果我们尝试存储前 k 个元素),则轮询头部。
这是我的意思的一个例子
public void storeKLargest(int[] nums, int k) {
PriorityQueue<Integer> pq = new PriorityQueue<>(k+1);
for(int num: nums){
if(pq.size() < k || pq.peek() < num)
pq.offer(num);
if(pq.size() == k+1)
pq.poll();
}
}
我使用了 Integer 的 PriorityQueue,但它很简单,可以用自定义对象替换它并输入自定义 Comparator。
除非我遗漏了一些明显的东西,否则我想这就是 OP 正在寻找的东西。
答案 9 :(得分:0)
创建具有大小限制的PriorityQueue。它存储N个最大数字。
{{1}}