您好我有以下问题:
我在MultiValueMap<String, Integer>
中存储字符串和相应的整数值列表
我存储了大约1300亿个字符串,一个字符串可以包含多达500个或更多的值。
对于每个值,我将在地图上随机访问。所以最糟糕的情况是13 000 000 * 500看涨期权。现在地图的速度很好但内存开销却很高。 MultiValueMap<String, Integer>
只是HashMap/TreeMap<String, <ArrayList<Integer>>
。 HashMap和TreeMap都有很多内存开销。一旦完成,我就不会修改地图,但我需要它在程序中随机访问的速度要快且尽可能小。 (我将它存储在磁盘上并在启动时加载它,序列化的映射文件占用大约600mb但在内存中大约需要3gb?)
最有效的内存是将String存储在已排序的字符串数组中,并为值提供相应的二维int数组。因此,访问将是字符串数组上的二进制搜索并获取相应的值。
现在我有三种方法可以实现目标:
我使用一个已排序的MultivalueMap(TreeMap)来创建所有内容。在我完成获取所有值后,我通过调用map.keyset().toArray(new String[0]);
来获取字符串数组。创建一个二维int数组并从多值映射中获取所有值。
Pro:它易于实现,在创建过程中仍然很快。
Con:从Map到Arrays的复制过程中占用的内存更多。
我从一开始就使用Arrays或ArrayLists,并将所有内容存储在那里 专业:最小的内存开销。 Con:这将非常慢,因为每次添加一个新Key时我都必须对Array进行排序/复制。另外,我需要实现自己的(可能更慢)排序,以保持相应的int数组的顺序相同字符串。难以实施
我使用Arrays和MultivalueMap作为缓冲区。程序完成创建阶段的10%或20%后,我会将值添加到数组并保持顺序,然后启动一个新的Map。 Pro:足够的速度和足够的内存效率。 骗局:难以实施。
这些解决方案都不适合我。您是否知道此问题的任何其他解决方案,可能是内存高效(MultiValue)Map实现?
我知道我可以使用数据库,所以不要把它作为答案发布。我想知道如何在不使用数据库的情况下做到这一点。
答案 0 :(得分:5)
如果您切换到Guava的Multimap - 我不知道您的应用程序是否可行 - 您可能可以使用Trove并获取
ListMultimap<String, Integer> multimap = Multimaps.newListMultimap(
new HashMap<String, Collection<Integer>>(),
new Supplier<List<Integer>>() {
public List<Integer> get() {
return new TIntListDecorator();
}
});
这将使ListMultimap
使用HashMap
映射到由List
数组支持的int[]
值,这应该是内存效率的,但您需要付费由于拳击,一个小的速度惩罚。您可以为MultiValueMap
执行类似的操作,但我不知道来自哪个库。
答案 1 :(得分:2)
根据您在映射中存储的Integer值,可能会因为具有不同的Integer实例而导致大量的堆内存开销,这些实例占用的RAM比原始int值多得多。
考虑使用Map
中的String
到浮动的众多IntArrayList
实现中的一个(例如,在 Colt 或用于Java的原始集合中) ),基本上实现一个由int数组支持的List ,而不是由一组Integer实例支持。
答案 2 :(得分:2)
您可以使用compressed strings大幅减少内存使用量。
此外,还有other个更激烈的解决方案(需要重新实现):
答案 3 :(得分:2)
首先,考虑整数占用的内存。你说的范围大约是0-4000000。 24位足以表示16777216个不同的值。如果这是可以接受的,你可以使用字节数组作为整数,每个整数3个字节,并节省25%。你必须索引这样的数组:
int getPackedInt(byte[] array, int index) {
int i = index*3;
return ((array[i] & 0xFF)<<16) + ((array[i+1] & 0xFF) <<8) + (array[i+2] & 0xFF);
}
int storePackedInt(byte[] array, int index, int value) {
assert value >= 0 && value <= 0xFFFFFF;
int i = index*3;
array[i] = (byte)((value>>16) & 0xFF);
array[i+1] = (byte)((value>>8) & 0xFF);
array[i+2] = (byte)(value & 0xFF);
}
你能说一下整数的分布吗?如果它们中的许多符合16位,则可以使用每个数字具有可变字节数的编码(类似于UTF-8用于表示字符)。
接下来,考虑是否可以在字符串上节省内存。字符串有什么特点?它们通常会持续多长时间?许多字符串会共享前缀吗?根据应用程序的特性量身定制的压缩方案可以节省大量空间(如falsarella指出的那样)。或者,如果许多字符串将共享前缀,则将它们存储在某种类型的搜索中可能会更有效。 (有一种称为“patricia”的trie可能适用于此应用程序。)作为奖励,请注意在搜索字符串中搜索字符串可以更快而不是搜索哈希映射(尽管你我必须进行基准测试,看看你的申请中是否属实。)
字符串都是ASCII吗?如果是这样,用于字符串的50%的内存将被浪费,因为Java char
是16位。同样,在这种情况下,您可以考虑使用字节数组。
如果你只需要查看字符串,而不是遍历存储的字符串,你也可以考虑一些非常规的东西:哈希字符串,并只保留哈希值。由于不同的String可以散列到相同的值,因此有可能仍然可以通过搜索“找到”从未存储过的字符串。但是如果你为哈希值使用了足够的比特(以及一个好的哈希函数),那么你可以使这个机会如此无限小,几乎肯定不会在宇宙的估计寿命中发生。
最后,结构本身有内存,它包含字符串和整数。我已经建议使用trie,但是如果你决定不这样做,那么没有什么比并行数组使用更少的内存 - 一个排序的字符串数组(你可以进行二进制搜索,如你所说),以及一个并行数组整数数组。在进行二进制搜索以查找String数组的索引之后,可以使用相同的索引来访问整数数组。
在构建结构时,如果您确定搜索trie是一个不错的选择,我会直接使用它。否则,您可以执行2次传递:一次构建一组字符串(然后将它们放入一个数组并对它们进行排序),第二次传递以添加整数数组。
答案 4 :(得分:2)
如果您的密钥字符串存在模式,特别是常见的根,则a Trie可能是存储数据量显着减少的有效方法。
以下是工作TrieMap的代码。
注意:使用EntrySet
迭代Map
的常规建议 适用于Trie
。它们在Trie
中效率极低,因此请尽量避免请求。
/**
* Implementation of a Trie structure.
*
* A Trie is a compact form of tree that takes advantage of common prefixes
* to the keys.
*
* A normal HashSet will take the key and compute a hash from it, this hash will
* be used to locate the value through various methods but usually some kind
* of bucket system is used. The memory footprint resulting becomes something
* like O(n).
*
* A Trie structure essentuially combines all common prefixes into a single key.
* For example, holding the strings A, AB, ABC and ABCD will only take enough
* space to record the presence of ABCD. The presence of the others will be
* recorded as flags within the record of ABCD structure at zero cost.
*
* This structure is useful for holding similar strings such as product IDs or
* credit card numbers.
*
*/
public class TrieMap<V> extends AbstractMap<String, V> implements Map<String, V> {
/**
* Map each character to a sub-trie.
*
* Could replace this with a 256 entry array of Tries but this will handle
* multibyte character sets and I can discard empty maps.
*
* Maintained at null until needed (for better memory footprint).
*
*/
protected Map<Character, TrieMap<V>> children = null;
/**
* Here we store the map contents.
*/
protected V leaf = null;
/**
* Set the leaf value to a new setting and return the old one.
*
* @param newValue
* @return old value of leaf.
*/
protected V setLeaf(V newValue) {
V old = leaf;
leaf = newValue;
return old;
}
/**
* I've always wanted to name a method something like this.
*/
protected void makeChildren () {
if ( children == null ) {
// Use a TreeMap to ensure sorted iteration.
children = new TreeMap<Character, TrieMap<V>>();
}
}
/**
* Finds the TrieMap that "should" contain the key.
*
* @param key
*
* The key to find.
*
* @param grow
*
* Set to true to grow the Trie to fit the key.
*
* @return
*
* The sub Trie that "should" contain the key or null if key was not found and
* grow was false.
*/
protected TrieMap<V> find(String key, boolean grow) {
if (key.length() == 0) {
// Found it!
return this;
} else {
// Not at end of string.
if (grow) {
// Grow the tree.
makeChildren();
}
if (children != null) {
// Ask the kids.
char ch = key.charAt(0);
TrieMap<V> child = children.get(ch);
if (child == null && grow) {
// Make the child.
child = new TrieMap<V>();
// Store the child.
children.put(ch, child);
}
if (child != null) {
// Find it in the child.
return child.find(tail(key), grow);
}
}
}
return null;
}
/**
* Remove the head (first character) from the string.
*
* @param s
*
* The string.
*
* @return
*
* The same string without the first (head) character.
*
*/
// Suppress warnings over taking a subsequence
private String tail(String s) {
return s.substring(1, s.length());
}
/**
*
* Add a new value to the map.
*
* Time footprint = O(s.length).
*
* @param s
*
* The key defining the place to add.
*
* @param value
*
* The value to add there.
*
* @return
*
* The value that was there, or null if it wasn't.
*
*/
@Override
public V put(String key, V value) {
V old = null;
// If empty string.
if (key.length() == 0) {
old = setLeaf(value);
} else {
// Find it.
old = find(key, true).put("", value);
}
return old;
}
/**
* Gets the value at the specified key position.
*
* @param o
*
* The key to the location.
*
* @return
*
* The value at that location, or null if there is no value at that location.
*/
@Override
public V get(Object o) {
V got = null;
if (o != null) {
String key = (String) o;
TrieMap<V> it = find(key, false);
if (it != null) {
got = it.leaf;
}
} else {
throw new NullPointerException("Nulls not allowed.");
}
return got;
}
/**
* Remove the value at the specified location.
*
* @param o
*
* The key to the location.
*
* @return
*
* The value that was removed, or null if there was no value at that location.
*/
@Override
public V remove(Object o) {
V old = null;
if (o != null) {
String key = (String) o;
if (key.length() == 0) {
// Its me!
old = leaf;
leaf = null;
} else {
TrieMap<V> it = find(key, false);
if (it != null) {
old = it.remove("");
}
}
} else {
throw new NullPointerException("Nulls not allowed.");
}
return old;
}
/**
* Count the number of values in the structure.
*
* @return
*
* The number of values in the structure.
*/
@Override
public int size() {
// If I am a leaf then size increases by 1.
int size = leaf != null ? 1 : 0;
if (children != null) {
// Add sizes of all my children.
for (Character c : children.keySet()) {
size += children.get(c).size();
}
}
return size;
}
/**
* Is the tree empty?
*
* @return
*
* true if the tree is empty.
* false if there is still at least one value in the tree.
*/
@Override
public boolean isEmpty() {
// I am empty if I am not a leaf and I have no children
// (slightly quicker than the AbstaractCollection implementation).
return leaf == null && (children == null || children.isEmpty());
}
/**
* Returns all keys as a Set.
*
* @return
*
* A HashSet of all keys.
*
* Note: Although it returns Set<S> it is actually a Set<String> that has been
* home-grown because the original keys are not stored in the structure
* anywhere.
*/
@Override
public Set<String> keySet() {
// Roll them a temporary list and give them a Set from it.
return new HashSet<String>(keyList());
}
/**
* List all my keys.
*
* @return
*
* An ArrayList of all keys in the tree.
*
* Note: Although it returns List<S> it is actually a List<String> that has been
* home-grown because the original keys are not stored in the structure
* anywhere.
*
*/
protected List<String> keyList() {
List<String> contents = new ArrayList<String>();
if (leaf != null) {
// If I am a leaf, a null string is in the set.
contents.add((String) "");
}
// Add all sub-tries.
if (children != null) {
for (Character c : children.keySet()) {
TrieMap<V> child = children.get(c);
List<String> childContents = child.keyList();
for (String subString : childContents) {
// All possible substrings can be prepended with this character.
contents.add((String) (c + subString.toString()));
}
}
}
return contents;
}
/**
* Does the map contain the specified key.
*
* @param key
*
* The key to look for.
*
* @return
*
* true if the key is in the Map.
* false if not.
*/
public boolean containsKey(String key) {
TrieMap<V> it = find(key, false);
if (it != null) {
return it.leaf != null;
}
return false;
}
/**
* Represent me as a list.
*
* @return
*
* A String representation of the tree.
*/
@Override
public String toString() {
List<String> list = keyList();
//Collections.sort((List<String>)list);
StringBuilder sb = new StringBuilder();
Separator comma = new Separator(",");
sb.append("{");
for (String s : list) {
sb.append(comma.sep()).append(s).append("=").append(get(s));
}
sb.append("}");
return sb.toString();
}
/**
* Clear down completely.
*/
@Override
public void clear() {
children = null;
leaf = null;
}
/**
* Return a list of key/value pairs.
*
* @return
*
* The entry set.
*/
public Set<Map.Entry<String, V>> entrySet() {
Set<Map.Entry<String, V>> entries = new HashSet<Map.Entry<String, V>>();
List<String> keys = keyList();
for (String key : keys) {
entries.add(new Entry<String,V>(key, get(key)));
}
return entries;
}
/**
* An entry.
*
* @param <S>
*
* The type of the key.
*
* @param <V>
*
* The type of the value.
*/
private static class Entry<S, V> implements Map.Entry<S, V> {
protected S key;
protected V value;
public Entry(S key, V value) {
this.key = key;
this.value = value;
}
public S getKey() {
return key;
}
public V getValue() {
return value;
}
public V setValue(V newValue) {
V oldValue = value;
value = newValue;
return oldValue;
}
@Override
public boolean equals(Object o) {
if (!(o instanceof TrieMap.Entry)) {
return false;
}
Entry e = (Entry) o;
return (key == null ? e.getKey() == null : key.equals(e.getKey()))
&& (value == null ? e.getValue() == null : value.equals(e.getValue()));
}
@Override
public int hashCode() {
int keyHash = (key == null ? 0 : key.hashCode());
int valueHash = (value == null ? 0 : value.hashCode());
return keyHash ^ valueHash;
}
@Override
public String toString() {
return key + "=" + value;
}
}
}