从TOC检索数据

时间:2017-11-27 09:09:40

标签: python

我的目标是通过以下步骤将The Python Language Reference — Python 3.6.3 documentation的TOC转换为结构化数据:

1.将内容复制到plr.md文件

In [1]: with open('plr.md') as file:
   ...:     content = file.read()
In [2]: content
Out[2]: '\n\n- \\1. Introduction\n  - [1.1. Alternate Implementations]
(https://docs.python.org/3.6/reference/introduction.html#alternate-implementations)\n  - [1.2. Notation](https://docs.python.org/3.6/reference/introduction.html#notation)\n- \\2. Lexical analysis\n  - [2.1. Line structure]
(https://docs.python.org/3.6/reference/lexical_analysis.html#line-structure)\n  - [2.2. Other tokens](https://docs.python.org/3.6/reference/lexical_analysis.html#other-tokens)\n

2.获取章节

In [47]: chapters = content.split('\n- \\')
    ...: #subtract the unqualified part
    ...: chapters = chapters[1:]
In [50]: chapters[0]
Out[50]: '1. Introduction\n  - [1.1. Alternate Implementations](https://docs.python.org/3.6/reference/introduction.html#alternate-implementations)
\n  - [1.2. Notation](https://docs.python.org/3.6/reference/introduction.html#notation)'

3.每章中的章节名称和章节名称

chapter_details = chapters[0].split('\n  -')
sections = chapter_details[1:]
chapter = chapter_details[0]
In [54]: chapter
Out[54]: '1. Introduction'
In [55]: sections
Out[55]:
[' [1.1. Alternate Implementations](https://docs.python.org/3.6/reference/introduction.html#alternate-implementations)',
 ' [1.2. Notation](https://docs.python.org/3.6/reference/introduction.html#notation)']

4.转换部分

def convert_section(s):
    start = s.index('[') + 1
    end = s.index(']')
    return s[start:end]
In [57]: print(convert_section(' [1.1. Alternate Implementations](https://docs.python.org/3.6/reference/i
    ...: ntroduction.html#alternate-implementations)'))
1.1. Alternate Implementations

sections = map(convert_section, sections)
sections = list(sections)

5.创建一个词典

key = chapter
{key:sections}
 {'1. Introduction':['1.1. Alternate Implementations', '1.2. Notation']}

6.encapsulate一个类中的代码并获得结果

class TOC:
    def __init__(self, filename):
        self.filename = filename

    def read(self, filename):
        with open (filename) as file:
            content = file.read()
        return content

    def convert_section(self, s):
        start = s.index('[') + 1
        end = s.index(']')
        return s[start:end]

    def get_chapters(self, filename):
        content = self.read(filename)
        chapters = content.split('\n- \\')
        #subtract the unqualified part
        chapters = chapters[1:]
        return chapters

    def create_chapter_dict(self, chapter):
        chapter_details = chapter.split('\n  -')
        sections = chapter_details[1:]
        key = chapter_details[0]
        value = map(self.convert_section, sections)
        return {key: list(value)}

    def get_chapters_dict(self):
        chapters = self.get_chapters(self.filename)
        chapters_dict = {}
        for chapter in chapters:
            chapter_dict = self.create_chapter_dict(chapter)
            chapters_dict.update(chapter_dict)
        return chapters_dict

运行并获取结果

In [89]: TOC('plr.md').get_chapters_dict()
Out[89]:
{'1. Introduction': ['1.1. Alternate Implementations', '1.2. Notation'],
 '2. Lexical analysis': ['2.1. Line structure',
  '2.2. Other tokens',
  '2.3. Identifiers and keywords',
  '2.4. Literals',
  '2.5. Operators',
  '2.6. Delimiters'],
 '3. Data model': ['3.1. Objects, values and types',
  '3.2. The standard type hierarchy',
  '3.3. Special method names',
  '3.4. Coroutines'],

这个解决方案对于日常常规操作来说有点过分,是否有标准或简单的方法来执行此类任务?

0 个答案:

没有答案