我的目标是通过以下步骤将The Python Language Reference — Python 3.6.3 documentation的TOC转换为结构化数据:
1.将内容复制到plr.md
文件
In [1]: with open('plr.md') as file:
...: content = file.read()
In [2]: content
Out[2]: '\n\n- \\1. Introduction\n - [1.1. Alternate Implementations]
(https://docs.python.org/3.6/reference/introduction.html#alternate-implementations)\n - [1.2. Notation](https://docs.python.org/3.6/reference/introduction.html#notation)\n- \\2. Lexical analysis\n - [2.1. Line structure]
(https://docs.python.org/3.6/reference/lexical_analysis.html#line-structure)\n - [2.2. Other tokens](https://docs.python.org/3.6/reference/lexical_analysis.html#other-tokens)\n
2.获取章节
In [47]: chapters = content.split('\n- \\')
...: #subtract the unqualified part
...: chapters = chapters[1:]
In [50]: chapters[0]
Out[50]: '1. Introduction\n - [1.1. Alternate Implementations](https://docs.python.org/3.6/reference/introduction.html#alternate-implementations)
\n - [1.2. Notation](https://docs.python.org/3.6/reference/introduction.html#notation)'
3.每章中的章节名称和章节名称
chapter_details = chapters[0].split('\n -')
sections = chapter_details[1:]
chapter = chapter_details[0]
In [54]: chapter
Out[54]: '1. Introduction'
In [55]: sections
Out[55]:
[' [1.1. Alternate Implementations](https://docs.python.org/3.6/reference/introduction.html#alternate-implementations)',
' [1.2. Notation](https://docs.python.org/3.6/reference/introduction.html#notation)']
4.转换部分
def convert_section(s):
start = s.index('[') + 1
end = s.index(']')
return s[start:end]
In [57]: print(convert_section(' [1.1. Alternate Implementations](https://docs.python.org/3.6/reference/i
...: ntroduction.html#alternate-implementations)'))
1.1. Alternate Implementations
sections = map(convert_section, sections)
sections = list(sections)
5.创建一个词典
key = chapter
{key:sections}
{'1. Introduction':['1.1. Alternate Implementations', '1.2. Notation']}
6.encapsulate一个类中的代码并获得结果
class TOC:
def __init__(self, filename):
self.filename = filename
def read(self, filename):
with open (filename) as file:
content = file.read()
return content
def convert_section(self, s):
start = s.index('[') + 1
end = s.index(']')
return s[start:end]
def get_chapters(self, filename):
content = self.read(filename)
chapters = content.split('\n- \\')
#subtract the unqualified part
chapters = chapters[1:]
return chapters
def create_chapter_dict(self, chapter):
chapter_details = chapter.split('\n -')
sections = chapter_details[1:]
key = chapter_details[0]
value = map(self.convert_section, sections)
return {key: list(value)}
def get_chapters_dict(self):
chapters = self.get_chapters(self.filename)
chapters_dict = {}
for chapter in chapters:
chapter_dict = self.create_chapter_dict(chapter)
chapters_dict.update(chapter_dict)
return chapters_dict
运行并获取结果
In [89]: TOC('plr.md').get_chapters_dict()
Out[89]:
{'1. Introduction': ['1.1. Alternate Implementations', '1.2. Notation'],
'2. Lexical analysis': ['2.1. Line structure',
'2.2. Other tokens',
'2.3. Identifiers and keywords',
'2.4. Literals',
'2.5. Operators',
'2.6. Delimiters'],
'3. Data model': ['3.1. Objects, values and types',
'3.2. The standard type hierarchy',
'3.3. Special method names',
'3.4. Coroutines'],
这个解决方案对于日常常规操作来说有点过分,是否有标准或简单的方法来执行此类任务?