Question

我需要创建一个有效的算法，从未排序的输入中返回唯一值。我不知道输入的长度。

由于调用此算法的函数可以随时中止读取，我认为使用定义良好的Iterable实现是正确的方法，因此我不会为不需要的输入浪费额外的处理能力。

今天，我正在使用Set来跟踪我已阅读过的值。但我不知道这是否是最有效的算法，因为我的输入长度可能很大。

以下代码是我今天的工作算法：

import java.util.Iterator;
import java.util.HashSet;
import java.util.Set;
import java.util.NoSuchElementException;
import java.io.BufferedReader;
import java.io.StringReader;
import java.io.IOException;

public class UniqueValues implements Iterable<String> {
    private final Iterator<String> iterator;

    public UniqueValues(BufferedReader r) {
        this.iterator = new UniqueValuesIterator(r);
    }

    public Iterator<String> iterator() {
        return iterator;
    }

    static class UniqueValuesIterator implements Iterator<String> {
        private final BufferedReader r;

        private final Set<String> values = new HashSet<>();

        // When 'next' is null, need to get the next value
        private String next;

        public UniqueValuesIterator(BufferedReader r) {
            this.r = r;
        }

        public boolean hasNext() {
            // Good point from OldCurmudgeon
            if(next != null) return true;

            try {
                String line;
                while((line = r.readLine()) != null) {
                    if(values.add(line)) { // add() returns 'true' when it is not a duplicate value.
                        next = line;
                        return true;
                    }
                }
            } catch(IOException e) { }

            return false;
        }

        public String next() {
            if(next == null) {
                if(! hasNext() ) throw new NoSuchElementException();
            }

            final String temp = next;
            next = null;
            return temp;
        }

        public void remove() {
            throw new UnsupportedOperationException();
        }
    }

    // For testing
    public static void main(String... args) {
        final StringReader r = new StringReader("value1\nvalue6\nvalue1\nvalue3\nvalue3\nvalue6\nvalue1\nvalue6");

        for(final String value : new UniqueValues(new BufferedReader(r)) ) {
            System.out.println(value);
        }

        /* Output is (order is not important):
         * 
         * line 1
         * line 6
         * line 3
         */
    }
}

它有更好的算法吗？

Answer 1

这看起来很好但是我很想让代码不那么通用，除非你经常这样做。

try(BufferedReader br = new BufferedReader(new FileReader(file))) {
     Set<String> lines = new HashSet<>();
     for(String line; (line = br.readLine()) != null;) {
        if(lines.add(line)) {
            // do something
        }
     }
 }

或者如果您必须返回Iterable，则可以

public static Set<String> uniqueLines(File file) {
    try(BufferedReader br = new BufferedReader(new FileReader(file))) {
         Set<String> lines = new HashSet<>();
         for(String line; (line = br.readLine()) != null;)
            lines.add(line))
         return lines;
     }
 }

Answer 2

如果您的输入仅包含字符串，则可以使用a trie来跟踪它们。它具有O（字符串长度）查找和插入时间，并且比哈希映射更节省空间。

但是，有一点需要注意：trie每个树节点的开销相当大，所以只有当输入足够大且元素足够相似时它才会变得更有效。例如，它不会为随机生成的字符串带来任何好处。

从未排序的输入返回唯一值的算法

2 个答案: