Question

我在使用包含List的Java编程时遇到了一些问题。基本上，我试图从包含几个句子的列表中计算句子中每个单词的出现次数。包含句子的列表的代码如下：

List<List<String>> sort = new ArrayList<>();
for (String sentence : complete.split("[.?!]\\s*"))
{
    sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
}

列表的输出如下：

[hurricane, gilbert, head, dominican, coast]
[hurricane, gilbert, sweep, dominican, republic, sunday, civil, defense, alert, heavily, populate, south, coast, prepare, high, wind]
[storm, approach, southeast, sustain, wind, mph, mph]
[there, alarm, civil, defense, director, a, television, alert, shortly]

所需的输出应如下所示（仅举例）。它将输出列表中的所有唯一单词并按句子计算出现的次数。

Word: hurricane
Sentence 1: 1 times
Sentence 2: 1 times
Sentence 3: 0 times
Sentence 4: 0 times

Word: gilbert
Sentence 1: 0 times
Sentence 2: 2 times
Sentence 3: 1 times
Sentence 4: 0 times 

Word: head
Sentence 1: 3 times
Sentence 2: 2 times
Sentence 3: 0 times
Sentence 4: 0 times 

and goes on....

通过上面的例子，“飓风”这个词是＆＃39;在第一句中出现1次，在第二句中出现1次，在第三句中出现1次，在第四句中没有出现。我如何实现输出？我正在考虑用于构建它们的2D矩阵。任何帮助将不胜感激。谢谢！

Answer 1

这是一个有效的解决方案。我没有照顾打印。结果是Map - ＆gt; Word，数组。其中Array包含从0开始索引的每个句子中的Word计数。在O（N）时间内运行。在这里播放：https://repl.it/Bg6D

    List<List<String>> sort = new ArrayList<>();
    Map<String, ArrayList<Integer>> res = new HashMap<>();

    // split by sentence
    for (String sentence : someText.split("[.?!]\\s*")) {
        sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
    }

    // put all word in a hashmap with 0 count initialized
    final int sentenceCount = sort.size();
    sort.stream().forEach(sentence -> sentence.stream().forEach(s -> res.put(s, new ArrayList<Integer>(Collections.nCopies(sentenceCount, 0)))));

    int index = 0;
    // count the occurrences of each word for each sentence.
    for (List<String> sentence: sort) {
        for (String s : sentence) {
            res.get(s).set(index, res.get(s).get(index) + 1);
        }
        index++;
    }

修改回答你的评论。

List<Integer> getSentence(int sentence, Map<String, ArrayList<Integer>> map) { return map.entrySet().stream().map(e -> e.getValue().get(sentence)).collect(Collectors.toList()); }

然后你可以打电话

List<Integer> sentence0List = getSentence(0, res);

但请注意，这种方法并不是最优的，因为它在O（K）时间内运行，其中K是句子数。对于小K来说它完全没问题，但它没有扩展。你必须澄清自己将对结果做些什么。如果您需要多次致电getSentence，这不是正确的做法。在这种情况下，您将需要以不同方式构建数据。像
这样的东西
Sentences = [ {'word1': N, 'word2': N},... // sentence 1 {'word1': N, 'word2': N},... // sentence 2

因此，您可以轻松访问每个句子的单词计数。

编辑2： 调用此方法：

Map<String, Float> getFrequency(Map<String, ArrayList<Integer>> stringMap) { Map<String, Float> res = new HashMap<>(); stringMap.entrySet().stream().forEach(e -> res.put(e.getKey() , e.getValue().stream().mapToInt(Integer::intValue).sum() / (float)e.getValue().size())); return res; }

会返回这样的内容：

{standard=0.25, but=0.25, industry's=0.25, been=0.25, 1500s=0.25, software=0.25, release=0.25, type=0.5, when=0.25, dummy=0.5, Aldus=0.25, only=0.25, passages=0.25, text=0.5, has=0.5, 1960s=0.25, Ipsum=1.0, five=0.25, publishing=0.25, took=0.25, centuries=0.25, including=0.25, in=0.25, like=0.25, containing=0.25, printer=0.25, is=0.25, t

Answer 2

您可以先为每个单词创建一个索引来解决您的问题。您可以使用Hashmap并将其放在文本中的所有单个单词上（这样您就不需要检查双重发生）。

然后你可以iterate the HashMap检查每个句子中的每个单词。您可以使用列表的indexOf method对事件进行计数。只要它返回大于-1的值，您就可以对句子中的出现次数进行计数。此方法仅返回第一次出现，因此

有些伪代码就像：

数组句子= text.split（句子分隔符）

for each word in text
    put word on hashmap

for each entry in hashmap
   for each sentence
       int count = 0
       while subList(count, sentence.length) indexOf(entry) > -1
          count for entry ++

注意这是非常贪婪的，根本不是面向性能的。哦，是的，还要注意，那里有一些java nlp libraries可能已经以性能导向和可重用的方式解决了你的问题。

Answer 3

首先，您可以对句子进行分段，然后使用文本分段器（例如NLTK或Stanford tokenizer）对其进行标记。将字符串（包含句子）拆分为“[。？！]”并不是一个好主意。 “等”会发生什么或“例如”发生在句子中间？在“[，;：]”周围分割句子也不是一个好主意。您可以在句子中包含大量其他符号，例如引号，短划线等。

在细分和标记后，您可以在空格周围分割句子并将其存储在List<List<String>>中：

List<List<String>> sentenceList = new ArraList();

然后，对于您的索引，您可以创建HashMap<String,List<Integer>>：

HashMap<String,List<Integer>> words = new HashMap();

键是所有句子中的所有单词。您可以按如下方式更新值：

for(int i = 0 ; i < sentenceList.size() ; i++){
    for(String w : words){
        if(sentence.contains(w)){
           List tmp = words.get(w);
           tmp.get(i)++; 
           words.put(w, tmp);
         }
    }
}

该解决方案具有O（number_of_sentences * number_of_words）的时间复杂度，其等于O（n ^ 2）。优化的解决方案是：

for(int i = 0 ; i < sentenceList.size() ; i++){
    for(String w : sentenceList.get(i)){
        List tmp = words.get(w);
        tmp.get(i)++; 
        words.put(w, tmp);
    }
}

这具有O的时间复杂度（number_of_sentences * average_length_of_sentences）。由于average_length_of_sentences通常较小，因此相当于O（n）。

计算包含句子的列表中单词的出现次数

3 个答案: