我正在尝试实现一组函数,这些函数直接在线性时间内创建一个DAWG,用于某些搜索功能,我正在编写个人项目。我读了this paper,它恰好详述了DAWG背后的想法,甚至为线性时间的构造提供了伪代码!
但是,遵循伪代码似乎会产生(在我看来)类似于trie的结构。具体来说,似乎没有明确共享后缀(实际上是通过图中的边连接)。相反,它们由后缀指针表示,后者指针实际上与图形的实际遍历无关。
例如,请看一下DAWG的图片,了解集{tap, taps, top, tops}
中的字词(来自DAWG Wikipedia page):
现在,将上述文章中详细介绍的结构与您所获得的结构进行比较(手动完成这组词语的时间可以忽略不计):
Note: Edges are labeled by letters
Nodes are labeled by the concatenation of the labels of the primary edges
used to reach them
Suffix pointers are not visually represented on the graph
primary edges: solid edges used to traverse graph
secondary edges: dotted edges implying a suffix relationship between
the letter labeling the edge and the substring
represented by the target node
builddawg(S)
1. Create a node named source.
2. Let activenode be source.
3. For each word w of S do:
A. For each letter 'a' of w do:
Let activenode be update (activenode, a).
B. Let activenode be source.
4. Return source.
update (activenode, a)
1. If activenode has an outgoing edge labeled 'a', then
A. Let newactivenode be the node that this edge leads to.
B. If this edge is primary, return newactivenode.
C. Else, return split (activenode, newactivenode).
2. Else
A. Create a node named newactivenode.
B. Create a primary edge labeled 'a' from activenode to newactivenode.
C. Let currentnode be activenode.
D. Let suflxnode be undefined.
E. While currentnode isn’t source and sufixnode is undefined do:
i. Let currentnode be the node pointed to by the suffix
pointer of currentnode.
ii. If currentnode has a primary outgoing edge labeled 'a',
then let sufixnode be the node that this edge leads to.
iii. Else,if currentnode has a secondary outgoing edge labeled 'a' then
a. Let childnode be the node that this edge leads to.
b. Let suffixnode be split (currentnode, childnode).
iv. Else, create a secondary edge from currentnode to newactivenode
labeled 'a'.
F. If sufixnode is still undefined, let suffixnode be source.
G. Set the suffix pointer of newactivenode to point to sufixnode.
H. Return newactivenode.
split (parentnode, childnode)
1. Create a node called newchildnode.
2. Make the secondary edge from parentnode to childnode into
a primary edge from parentnode to newchildnode (with the same label).
3. For every primary and secondary outgoing edge of childnode,
create a secondary outgoing edge of newchildnode with the
same label and leading to the same node.
4. Set the suffix pointer of newchildnode equal to that of childnode.
5. Reset the suffix pointer of childnode to point to newchildnode.
6. Let currentnode be parentnode.
7. While currentnode isn’t source do:
A. Let currentnode be the node pointed to by the
suffix pointer of currentnode.
B. If currentnode has a secondary edge to childnode,
then make it a secondary edge to newchildnode (with the same label).
C. Else, break out of the while loop.
8. Return newchildnode.
我得到的结构与上图不同。实际上,它看起来几乎与trie相同,除了将辅助边缘转换为主边缘所产生的额外节点。相当于上述DAWG的特里是:
我只是应用算法错误,是否有几种类型的DAWGS,或者我只是误解DAWG应该是什么样的?
我看过详细介绍DAWGs的大部分论文都有似乎是由算法创建的结构,但我在网上阅读的大部分资料(以及我见过的图片)都有实际边缘连接普通后缀。我不知道该相信什么,或者它们是否真的相同。
答案 0 :(得分:1)
我相信我找到了解决方案。
构建DAWG之后,您可以从上到下遍历节点,并删除suffixPointer != source
个子树的子树,将它们直接连接到suffixPointer
指向的节点。