我正在尝试获取维基百科页面中的部分,子部分,子部分的层次结构。
我有一个这样的字符串:
mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
在这种情况下,页面名称为“a”,结构如下
= b =
= c =
== d ==
== e ==
=== f ===
=== g ===
==== h ====
=== i ===
== j ==
== k ==
= l =
平等标志是部分或子部分的指标等。我需要获取一个包含所有关系层次结构的python列表,如下所示:
mylist = ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g',
'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']
到目前为止,我已经能够通过这样做找到部分,子部分等:
sections = re.findall(r' = (.*?)\ =', mystr)
subsections = re.findall(r' == (.*?)\ ==', mystr)
...
但我不知道如何从这里开始获得所需的mylist。
答案 0 :(得分:0)
你可以这样做:
- 第一个函数解析你的字符串,并产生令牌(级别,名称),如(0,' a'),(1,' b')
- 第二个从那里构建树。
import re
def tokens(string):
# The root name doesn't respect the '= name =' convention,
# so we cut the string on the first " = " and yield the root name
root_end = string.index(' = ')
root, rest = string[:root_end], string[root_end:]
yield 0, root
# We use a regex for the next tokens, who consist of the following groups:
# - any number of "=" followed by 0 or more spaces,
# - the name, not containing any =
# - and again, the first group of "=..."
tokens_re = re.compile(r'(=+ ?)([^=]+)\1')
# findall will return a list:
# [('= ', 'b '), ('= ', 'c '), ('== ', 'd '), ('== ', 'e '), ('=== ', 'f '), ...]
for token in tokens_re.findall(rest):
level = token[0].count('=')
name = token[1].strip()
yield level, name
def tree(token_list):
out = []
# We keep track of the current position in the hierarchy:
hierarchy = []
for token in token_list:
level, name = token
# We cut the hierarchy below the level of our token
hierarchy = hierarchy[:level]
# and append the current one
hierarchy.append(name)
out.append('/'.join(hierarchy))
return out
mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
out = tree(tokens(mystr))
# Check that this is your expected output
assert out == ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g',
'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']
print(out)
# ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']