Question

想象一下，我有一种情况需要索引句子。让我更深入地解释一下。

例如我有这些句子：

美丽的天空。
美丽的天空梦想。
美丽的梦想。

据我所知，索引应该是这样的：

alt text http://img7.imageshack.us/img7/4029/indexarb.png

但我也希望通过任何这些词进行搜索。

例如，如果我通过“the”搜索它应该显示给我连接到“美丽”。如果我按“美丽”搜索，它应该给我（上一个）“The”，（下一个）“sky”和“dream”的连接。如果我通过“天空”搜索它应该给（之前）连接到“美丽”等...

任何想法？也许你知道这种问题的现有算法？

Answer 1

简答

使用两个前/后链接向量创建一个结构。然后将单词structs存储在哈希表中，并将键作为单词本身。

长答案

这是一个语言解析问题，除非你不介意胡言乱语，否则不容易解决。

我去了公园篮球场。
你会把车停好吗？

您的链接算法将创建如下句子：

我去了公园。
你会把篮球场停在那里吗？

我不太确定这个搜索引擎优化应用程序，但我不欢迎另一个乱码垃圾网站占用搜索结果。

Answer 2

我想你会想要某种Inverted index结构。您将拥有一个Hashmap，其中的单词为键，指向(sentence_id, position)形式的对列表。然后，您将句子存储为数组或链接列表。你的例子看起来像这样：

sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];

inverted_index = 
{
 'the': {(0,0)},
 'beautiful': {(0,1), (1,0), (2,0)},
 'sky' : {(0,2),(1,1)},
 'dream':{(1,2), (2,1)}
};

使用此结构可以在恒定时间内对单词进行查找。识别出你想要的单词后，找到给定句子中的上一个和后一个单词也可以在不变的时间内完成。

希望这有帮助。

Answer 3

你可以尝试挖掘Markov chains，由句子的单词组成。此外，您还需要双向链（即查找下一个和前一个单词），即存储在给定之后或之前出现的可能单词。

当然，马尔可夫链是一个生成内容的随机过程，但是类似的方法可能用于存储您需要的信息。

Answer 4

看起来它可以存储在一个非常简单的数据库中，其中包含以下表格：

Words:
    Id     integer primary-key
    Word   varchar(20)
Following:
    WordId1 integer foreign-key Words(Id) indexed
    WordId2 integer foreign-key Words(Id) indexed

然后，每当你解析一个句子时，只需插入那些不存在的句子，如下所示：

The beautiful sky.
    Words (1,'the')
    Words (2, 'beautiful')
    Words (3,, 'sky')
    Following (1, 2)
    Following (2, 3)
Beautiful sky dream.
    Words (4, 'dream')
    Following (3, 4)
Beautiful dream.
    Following (2, 4)

然后，您可以查询您的内容，了解其他单词后面或之前的单词。

Answer 5

使用associative array将允许您快速解析Perl中的句子。它比你预期的要快得多，它可以被有效地转储到类似树的结构中，以供后续的高级语言使用。

Answer 6

这个oughta让你接近，在C＃：

class Program
{
    public class Node
    {
        private string _term;
        private Dictionary<string, KeyValuePair<Node, Node>> _related = new Dictionary<string, KeyValuePair<Node, Node>>();

        public Node(string term)
        {
            _term = term;
        }

        public void Add(string phrase, Node previous, string [] phraseRemainder, Dictionary<string,Node> existing)
        {
            Node next= null;
            if (phraseRemainder.Length > 0)
            {
                if (!existing.TryGetValue(phraseRemainder[0], out next))
                {
                    existing[phraseRemainder[0]] = next = new Node(phraseRemainder[0]);
                }
                next.Add(phrase, this, phraseRemainder.Skip(1).ToArray(), existing);
            }
            _related.Add(phrase, new KeyValuePair<Node, Node>(previous, next));

        }
    }


    static void Main(string[] args)
    {
        string [] sentences = 
            new string [] { 
                "The beautiful sky",
                "Beautiful sky dream",
                "beautiful dream"
            };

        Dictionary<string, Node> parsedSentences = new Dictionary<string,Node>();

        foreach(string sentence in sentences)
        {
            string [] words = sentence.ToLowerInvariant().Split(' ');
            Node startNode;
            if (!parsedSentences.TryGetValue(words[0],out startNode))
            {
                parsedSentences[words[0]] = startNode = new Node(words[0]);
            }
            if (words.Length > 1)
                startNode.Add(sentence,null,words.Skip(1).ToArray(),parsedSentences);
        }
    }
}

我冒昧地假设你想要保留实际的初始短语。最后，您将在短语中包含单词列表，并在每个单词中包含使用该单词的短语列表，并引用每个短语中的下一个和上一个单词。

Answer 7

树搜索算法（如BST，等）

索引句子的最佳算法

7 个答案: