Question

我有一个程序可以读取文档并在每个页面中搜索给定的搜索词。然后它返回单词出现的页面。

即。 “辉煌”这个词出现在以下几页：1,4,6,8

目前我将文件拆分成页面并将其存储到ArrayList中。 ArrayList的每个元素都包含文档的一页

然后我将页面上的每个单词拆分并存储到hashMap中，其中KEY是该单词出现在文本中的位置（我需要知道其他功能），值为单词。然后我使用;

搜索HashMap

if (map.containsValue(searchString) == true)
                return true;
             else
                 return false;

我为每个PAGE都这样做。

一切正常但我想知道是否有更高效的数据结构我可以使用它存储给定页面上的所有单词以及它出现的页面上的位置？（因为搜索地图中的值不给出关键是0（n））。

我需要能够搜索这个结构并找到一个单词。请记住，我也需要这个职位供以后使用。

我用来填充地图中包含文字中单词位置的代码是;

    // text is the page of text from a document as a string
int key = 1; // position of the word in the text
    for (String element : text.split(" "))
            {
                map.put(key, element);
                key++;
            }

Answer 1

为什么不使用单个HashMap<String,ArrayList<Position>>将单词映射到出现位置？文本中的每个单词都是地图中的一个键，页码和位置将构成条目列表。

由于列表值，插入有点棘手：

ArrayList<Position> positions = words.get(word);
if (positions == null) {
  positions = new ArrayList<Position>();
  words.put(word, positions);
}
positions.add(position);

Alernatively，您可以使用Guava Multimap：http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Multimap.html（特别是如果您已经将Guava用于其他目的 - 我可能会为此避免引入库依赖项）

编辑：将整数更改为位置（并将设置更改为列表），忽略了需要的确切位置。位置应该类似于

class Position {
  int page;
  int index; 
}

Answer 2

我可能会自己使用Lucene或来自Guava collections的内容，但除非我认为最有效的结构是：

HashMap<String, TreeMap<Integer, TreeSet<Integer>>> words;

        ^^^^^^          ^^^^^^^          ^^^^^^^
         word            page            position

使用words.get("brilliant").keySet();会立即为您提供所有＆＃34;辉煌＆＃34;出现在。如果我没有弄错的话，那就是O(log n)而不是O(n)。

在阅读评论后，您还需要在每个搜索词之前和之后检索单词，我认为您需要第二个数据结构用于该查找：

TreeSet<Integer, TreeMap<Integer, String>> positions;

        ^^^^^^^          ^^^^^^^  ^^^^^^
         page            position  word

或者，使用两个列表的相应索引作为页面和位置：

ArrayList<ArrayList<String>> positions;

用于在文本Java中搜索单词的最有效的数据结构

2 个答案: