我正在尝试编写一个以文本语料库开头的算法(例如,维基百科转储)。它首先构建一个单独的字符数组(例如"a"
,"b"
,"7"
,"0"
,"."
," "
)和频繁的子字符串(例如," the "
,"ly "
," un"
,"er"
,"some"
)。然后它将文本语料库分成这些子串的序列。
我按如下方式构建数组(不是最佳的,但它可能已经足够好了):
from collections import defaultdict
corpus = [ ] # An array of strings, each string being a text
max_len = 5 # Longest sequence of characters to look for
strings = defaultdict(int)
for text in corpus:
for n in range(max_len):
for i in range(len(text)):
if i+n < len(text):
strings[text[i:i+n+1]] += 1
dict_size = 1000000 # Rough number of sequences to keep
min_count = sorted(strings.values(), reverse=True)[dict_size]
for key, count in strings.items():
if count < min_count and len(key) > 1: # Keep frequent sequences and all individual chars
del strings[key]
看起来最棘手的部分是对文本进行良好(或最佳)分割。我假设较长的子串不太频繁,而且较短的子串不太可能发生碰撞,所以我的想法是采用贪婪的方法:我按顺序对每个可能的长度按顺序运行一次。我使用链表来跟踪拆分文本:
class Token:
def __init__(self, text, next=None):
self._text = text
self._next = next
self._done = False
@property
def next(self):
return self._next
@next.setter
def next(self, token):
self._next = token
@property
def text(self):
return self._text
@text.setter
def text(self, text):
self._text = text
@property
def done(self):
return self._done
@done.setter
def done(self, status):
self._done = status
for text in corpus:
head = Token(text) # Start the list with one token of the whole text
for n in range(max_len, 0, -1): # Run through lengths in decreasing order
token = head # Go back to the start for each run
i = 0
while True:
if token.done or i+n > len(token.text): # If token is processed or end of token is reached
if token.next:
token = token.next # Continue linked list
i = 0
continue
else:
break
if token.text[i:i+n] in strings:
if i > 0: # Split out characters before match
token.next = Token(token.text[i:], token.next)
token.text = token.text[:i]
token = token.next
token.done = True
if len(token.text) > n: # Split out characters after match
token.next = Token(token.text[n:], token.next)
token.text = token.text[:n]
token = token.next
i = 0
else:
i += 1 # Run through characters in the token
这似乎有效,但我不确定它实际上是多么优化,而且我无法想到一个不会在指数时间内运行的算法。最初我认为这可能是一个动态的编程问题,但我认为它不起作用,因为字符空间太稀疏。