Question

我使用Dico类来存储术语的重量和出现的文档ID

public class Dico 
{
   private String m_term; // term
   private double m_weight; // weight of term
   private int m_Id_doc; // id of doc that contain term

   public Dico(int Id_Doc,String Term,double tf_ief ) 
   {
      this.m_Id_doc = Id_Doc;
      this.m_term = Term;
      this.m_weight = tf_ief;
   }
   public String getTerm()
   {
      return this.m_term;
   }

   public double getWeight()
   {
     return this.m_weight;
   }

   public void setWeight(double weight)
   {
     this.m_weight= weight;
   }

   public int getDocId()
   {
     return this.m_Id_doc;
   }                
}

我使用此方法从Map<String,Double>和List<Dico>计算最终体重：

 public List<Dico> merge_list_map(List<Dico> list,Map<String,Double> map)
 {
    // in map each term is unique but in list i have redundancy

   List<Dico> list_term_weight = new ArrayList <>();

   for (Map.Entry<String,Double> entrySet : map.entrySet())
   {
       String key = entrySet.getKey();
       Double value = entrySet.getValue();

       for(Dico dic : list)
       {    
          String term =dic.getTerm();
          double weight = dic.getWeight();

          if(key.equals(term))
          {
             double new_weight =weight*value;                
             list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
          }                  
       } 
    }
    return list_term_weight;
 }

我在地图中有36736个元素，在列表中有1053914个元素，目前这个程序需要花费大量时间来编译：{{1}}（总时间：17分15秒）。

如何从列表中只获得与地图中的术语相等的术语？

Answer 1

您可以使用Map的查找功能，即Map.get（），因为您的地图将术语映射到权重。这应该会有显着的性能提升。唯一的区别是输出列表按顺序作为输入列表，而不是键在加权映射中出现的顺序。

public List<Dico> merge_list_map(List<Dico> list, Map<String, Double> map)
{
    // in map each term is unique but in list i have redundancy
    List<Dico> list_term_weight = new ArrayList<>();

    for (Dico dic : list)
    {
        String term = dic.getTerm();
        double weight = dic.getWeight();

        Double value = map.get(term);  // <== fetch weight from Map
        if (value != null)
        {
            double new_weight = weight * value;

            list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));

        }
    }
    return list_term_weight;
}

基础测试

List<Dico> list = Arrays.asList(new Dico(1, "foo", 1), new Dico(2, "bar", 2), new Dico(3, "baz", 3));
Map<String, Double> weights = new HashMap<String, Double>();
weights.put("foo", 2d);
weights.put("bar", 3d);
System.out.println(merge_list_map(list, weights));

输出

[Dico [m_term=foo, m_weight=2.0, m_Id_doc=1], Dico [m_term=bar, m_weight=6.0, m_Id_doc=2]]

时间测试 - 10,000个元素

List<Dico> list = new ArrayList<Dico>();
Map<String, Double> weights = new HashMap<String, Double>();
for (int i = 0; i < 1e4; i++) {
    list.add(new Dico(i, "foo-" + i, i));
    if (i % 3 == 0) {
        weights.put("foo-" + i, (double) i);  // <== every 3rd has a weight
    }
}

long t0 = System.currentTimeMillis();
List<Dico> result1 = merge_list_map_original(list, weights);
long t1 = System.currentTimeMillis();
List<Dico> result2 = merge_list_map_fast(list, weights);
long t2 = System.currentTimeMillis();

System.out.println(String.format("Original: %d ms", t1 - t0));
System.out.println(String.format("Fast:     %d ms", t2 - t1));

// prove results equivalent, just different order
// requires Dico class to have hashCode/equals() - used eclipse default generator
System.out.println(new HashSet<Dico>(result1).equals(new HashSet<Dico>(result2)));

输出

Original: 1005 ms
Fast:     16 ms  <=== loads quicker
true

Answer 2

另外，检查Map的初始化。（http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html）地图的重演在性能上是昂贵的。

作为一般规则，默认加载因子（.75）提供了一个好处   时间和空间成本之间的权衡。值越高，值越低   空间开销，但增加了查找成本（反映在大多数   HashMap类的操作，包括get和put）。预期的   应该考虑地图中的条目数量及其加载因子   帐户设置其初始容量时，以便最小化   重新运算的次数。如果初始容量大于   条目的最大数量除以加载因子，没有重新哈希   操作将永远发生。

如果要将多个映射存储在HashMap实例中，请创建它   具有足够大的容量将允许映射   存储比让它执行自动重组更有效   需要增长表。

如果你知道，或者有一个近似的地图元素数量，你可以这样创建你的地图：

Map<String, Double> foo = new HashMap<String, Double>(maxSize * 2);

根据我的经验，您可以将性能提高2倍或更多。

Answer 3

为了使merge_list_map函数有效，您需要实际使用Map它是什么：一个有效的键查找数据结构。正如您所做的那样，循环Map条目并在List中查找匹配项，算法为O（N * M），其中M是地图的大小，N是大小的名单。这肯定是你能得到的最糟糕的。

如果您首先遍历List，然后对每个Term进行循环，请使用Map在Map$get(String key)中进行查找，您将获得时间复杂度O（N）因为地图查找可以被认为是O（1）。

在设计方面，如果你可以使用Java8，你的问题可以用Stream s来翻译：

public static List<Dico> merge_list_map(List<Dico> dico, Map<String, Double> weights) {
    List<Dico> wDico = dico.stream()
            .filter  (d -> weights.containsKey(d.getTerm()))
            .map     (d -> new Dico(d.getTerm(), d.getWeight()*weights.get(d.getTerm())))
            .collect (Collectors.toList());
    return wDico;
}

新的加权列表是按照逻辑过程构建的：

stream()：将列表作为Dico元素的流
filter()：仅保留Dico位于term地图

weights

map()：对于每个已过滤的元素，使用计算的权重创建一个new Dico()实例。
collect()：收集新列表中的所有新实例
使用新权重返回包含已过滤的Dico的新列表。

表现明智，我针对来自E.A.Poe的一些文字the narrative of Arthur Gordon Pym对其进行了测试：

String text = null;
try (InputStream url = new URL("http://www.gutenberg.org/files/2149/2149-h/2149-h.htm").openStream())  {
    text = new Scanner(url, "UTF-8").useDelimiter("\\A").next();    
}
String[] words = text.split("[\\p{Punct}\\s]+");
System.out.println(words.length); // => 108028

由于本书中只有10万字，为了更好地衡量，只需x10（initDico()是帮助您从单词构建List<Dico>）：

List<Dico> dico = initDico(words);
List<Dico> bigDico = new ArrayList<>(10*dico.size());
for (int i = 0; i < 10; i++) {
    bigDico.addAll(dico);
}
System.out.println(bigDico.size()); // 1080280

使用所有单词构建权重贴图（initWeights()构建书中单词的频率图）：

Map<String, Double> weights = initWeights(words);
System.out.println(weights.size()); // 9449 distinct words

测试合并 1M单词与权重图：

long start = System.currentTimeMillis();
List<Dico> wDico = merge_list_map(bigDico, weights);
long end = System.currentTimeMillis();
System.out.println("===== Elapsed time (ms): "+(end-start)); 
// => 105 ms

权重图明显小于你的权重，但它不应影响时间，因为查找操作处于准常数时间。

这不是该功能的严重基准，但它已经表明merge_list_map()得分应小于1秒（加载和构建列表和地图不是该功能的一部分）。

为了完成练习，以下是上述测试中使用的初始化方法：

private static List<Dico> initDico(String[] terms) {
    List<Dico> dico = Arrays.stream(terms)
            .map(String::toLowerCase)
            .map(s -> new Dico(s, 1.0))
            .collect(Collectors.toList());
    return dico;
}

// weight of a word is the frequency*1000
private static Map<String, Double> initWeights(String[] terms) {
    Map<String, Long> wfreq = termFreq(terms);
    long total = wfreq.values().stream().reduce(0L, Long::sum);
    return wfreq.entrySet().stream()
            .collect(Collectors.toMap(Map.Entry::getKey, e -> (double)(1000.0*e.getValue()/total)));
}

private static Map<String, Long> termFreq(String[] terms) {
    Map<String, Long> wfreq = Arrays.stream(terms)
            .map(String::toLowerCase)
            .collect(groupingBy(Function.identity(), counting()));
    return wfreq;
}

Answer 4

您应该将方法contains()用于list。这样你就可以避免第二个for。即使contains()方法具有O（n）复杂度，您也应该看到一个小的改进。当然，请记住重新实施equals()。否则你应该使用第二个Map，如机器人建议的那样。

Answer 5

使用Map的查找功能，正如Adam指出的那样，并使用HashMap作为Map的实现--HashMap查找复杂度为O（1）。这应该会提高性能。

从列表和地图中提高构图的速度

5 个答案:

基础测试

时间测试 - 10,000个元素