Question

我正在尝试获取维基百科页面中的部分，子部分，子部分的层次结构。

我有一个这样的字符串：

mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='

在这种情况下，页面名称为“a”，结构如下

= b =
= c =
  == d ==
  == e ==
     === f ===
     === g ===
         ==== h ====
     === i ===
  == j ==
  == k ==
= l =

平等标志是部分或子部分的指标等。我需要获取一个包含所有关系层次结构的python列表，如下所示：

mylist = ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

到目前为止，我已经能够通过这样做找到部分，子部分等：

sections = re.findall(r' = (.*?)\ =', mystr)
subsections = re.findall(r' == (.*?)\ ==', mystr)
...

但我不知道如何从这里开始获得所需的mylist。

Answer 1

你可以这样做：
- 第一个函数解析你的字符串，并产生令牌（级别，名称），如（0，＆＃39; a＆＃39;），（1，＆＃39; b＆＃39;）
- 第二个从那里构建树。

import re

def tokens(string):
    # The root name doesn't respect the '= name =' convention,
    # so we cut the string on the first " = " and yield the root name
    root_end = string.index(' = ') 
    root, rest = string[:root_end], string[root_end:]
    yield 0, root

    # We use a regex for the next tokens, who consist of the following groups:
    # - any number of "=" followed by 0 or more spaces,
    # - the name, not containing any =
    # - and again, the first group of "=..."

    tokens_re = re.compile(r'(=+ ?)([^=]+)\1')
    # findall will return a list:
    # [('= ', 'b '), ('= ', 'c '), ('== ', 'd '), ('== ', 'e '), ('=== ', 'f '), ...]
    for token in tokens_re.findall(rest):
        level = token[0].count('=')
        name = token[1].strip()
        yield level, name


def tree(token_list):    
    out = []
    # We keep track of the current position in the hierarchy:
    hierarchy = []
    for token in token_list:
        level, name = token
        # We cut the hierarchy below the level of our token
        hierarchy = hierarchy[:level]
        # and append the current one
        hierarchy.append(name)
        out.append('/'.join(hierarchy))
    return out


mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
out = tree(tokens(mystr))
# Check that this is your expected output
assert out == ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

print(out)
# ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

从python字符串

1 个答案: