Question

我是Java的新手，我不知道java集合实现之间的区别。

我必须处理最多100K的导入数据记录。该列表可能有重复项。我必须把所有这些都放到DB中。在导入之前我清理数据库表，因此在开头的DB中没有重复项。

使用hibernate插入数据的批处理。我想做这样的事情：

SomeCollectionClass<Integer> alreadyInsertedRecords;
//...
if (!alreadyInsertedRecords.contains(currentRecord.hashCode()) {
    save_to_database(currentRecord);
    alreadyInsertedRecords.put(currentRecord.hashCode());
} else {
    logger.log("Record no 1234 is a duplicate, skipping");
}

我应该使用哪个集合类来检查记录是否已插入db？

正如我所说，可能有超过10万条记录，因此集合应该快速搜索，快速插入并且内存占用少。

Answer 1

您可以尝试使用HashSet。请记住，包含的对象的类必须正确实现方法hashCode（）和equals（）。

Answer 2

如果条目是可排序的，您可以使用 TreeSet 集合，该集合会自动修剪所有重复的条目，前提是它们已实施有效的compareTo()和equals()方法。

此系列还provides guaranteed log(n) time cost for the basic operations (add, remove and contains). [reference]

如果您可以访问hashCode()功能，则可以使用 HashSet 。它的工作方式与TreeSet（插入时的剪枝）相似，而且速度更快。

Colsult Hashset vs Treeset询问有关这两个系列的详细信息。

如果可能，请使用HashSet。

Answer 3

如果您不想复制，可以使用

Set<Integer> alreadyInsertedRecords = new HashSet<Integer>()

Answer 4

我不会为此使用集合，因为它可以在数据库级别完成。您可以使用不存在语句的插入。

例如

insert into people (firstName, lastName) 
select 'Foo', 'Bar'
where not exists (
    select 1 from people where firstName = 'Foo' and lastName = 'Bar'
)

我应该使用哪个集合来检查值是否在100K元素的集合中？

4 个答案: