Question

我编写的代码将文本标记作为输入：

tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]

代码应该找到所有包含连字符的标记或者用连字符相互连接：基本上输出应该是：

[["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]]

我写了一段代码，但不知怎的，我没有得到大量连接令牌的回复：尝试一下：http://goo.gl/iqov0q

def find_hyphens(self):
    tokens_with_hypens =[]


    for i in range(len(self.tokens)):

        hyp_leng = 0

        while self.hypen_between_two_tokens(i + hyp_leng):
            hyp_leng += 1

        if self.has_hypen_in_middle(i) or hyp_leng > 0:
            if hyp_leng == 0:
                tokens_with_hypens.append(self.tokens[i:i + 1])
            else:
                tokens_with_hypens.append(self.tokens[i:i + hyp_leng])
                i += hyp_leng - 1

    return tokens_with_hypens

我错了什么？有更高性能的解决方案吗？感谢

Answer 1

我在你的代码中发现了3个错误：

1）您在这里比较tok1的最后2个字符，而不是tok1的最后一个字符和tok2的第一个字符：

if "-" in joined[len(tok1) - 2: len(tok1)]:
# instead, do this:
if "-" in joined[len(tok1) - 1: len(tok1) + 1]:

2）你在这里省略了最后一个匹配的标记。将切片的结束索引增加1：

tokens_with_hypens.append(self.tokens[i:i + hyp_leng])
# instead, do this:
tokens_with_hypens.append(self.tokens[i:i + 1 + hyp_leng])

3）你无法在python中操纵for i in range循环的索引。下一次迭代将只检索下一个索引，覆盖您的更改。相反，你可以像这样使用while循环：

i = 0
while i < len(self.tokens):
    [...]
    i += 1

这3次更正会导致您的测试通过：http://goo.gl/fd07oL

尽管如此，我无法从头开始编写算法，尽可能简单地解决您的问题：

def get_hyphen_groups(tokens):
    i_start, i_end = 0, 1
    while i_start < len(tokens):
        while (i_end < len(tokens) and
              (tokens[i_end].startswith("-") ^ tokens[i_end - 1].endswith("-"))):
            i_end += 1
        yield tokens[i_start:i_end]
        i_start, i_end = i_end, i_end + 1


tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]

for group in get_hyphen_groups(tokens):
    print ("".join(group))

要排除1个元素组，就像在预期结果中一样，将yield包装到此if中：

if i_end - i_start > 1:
    yield tokens[i_start:i_end]

要包含已包含连字符的1个元素组，请将if更改为此例如：

if i_end - i_start > 1 or "-" in tokens[i_start]:
    yield tokens[i_start:i_end]

Answer 2

您的方法有一个问题是尝试更改i循环中for i in range(len(self.tokens))的值。它不会起作用，因为i的值会在每次迭代中从range获得下一个值，忽略您的更改。

我将您的算法更改为使用迭代算法，该算法从列表中弹出一个项目并决定如何处理它。它使用一个缓冲区来存储属于一条链的项目，直到它完成。

完整的代码是：

class Hyper:

    def __init__(self, tokens):
        self.tokens = tokens

    def find_hyphens(self):
        tokens_with_hypens =[]

        copy = list(self.tokens)

        buffer = []
        while len(copy) > 0:
            item = copy.pop(0)
            if self.has_hyphen_in_middle(item) and item[0] != '-' and item[-1] != '-':
                # words with hyphens that are not part of a bigger chain
                tokens_with_hypens.append([item])
            elif item[-1] == '-' or (len(copy) > 0 and copy[0][0] == '-'):
                # part of a chain - append to the buffer
                buffer.append(item)
            elif len(buffer) > 0:
                # the last word in a chain - the buffer contains the complete chain
                buffer.append(item)
                tokens_with_hypens.append(buffer)
                buffer = []

        return tokens_with_hypens

    @staticmethod
    def has_hyphen_in_middle(input):
        return len(input) > 2 and "-" in input[1:-2]


tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]

hyper = Hyper(tokens)

result = hyper.find_hyphens()

print(result)

print(result == [["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]])

查找已连接的令牌

2 个答案: