Question

我是python的新手，并且一直在开展各种项目以加快速度。目前，我正在制定一个例行程序，通读“联邦法规”和每个段落，打印该段落的组织层次结构。例如，CFR的XML方案的简化版本如下所示：

<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
     <SECTION>
        <SECTNO>### 229.120</SECTNO>
        <SUBJECT>Transfers of property.</SUBJECT>
        <P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
     </SECTION>

我希望能够将其打印为CSV，以便我可以运行文本分析：

Title 22，Volume 2，Part 229，Section 228.120，如果收件人出售或以其他方式转让财产（......），但须遵守### 229.205至229.235（a）的规定。

请注意，我没有从XML中获取标题和卷号，因为它们实际上以更加标准化的格式包含在文件名中。

因为我是这样的Python新手，所以代码主要基于Udacity计算机科学课程的搜索引擎代码。这是我到目前为止编写/改编的Python：

import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()

def clean_title(file_name): #Gets the title number from the filename.
    start_title = file_name.find('title')
    end_title = file_name.find("-", start_title+1)
    title = file_name[start_title+5:end_title]
    return title

def clean_volume(file_name): #Gets the volume number from the filename.
    start_volume = file_name.find('vol')
    end_volume = file_name.find('.xml', start_volume)
    volume = file_name[start_volume+3:end_volume]
    return volume

def get_next_section(page): #Gets all of the text between <SECTION> tags.
    start_section = page.find('<SECTION')
    if start_section == -1:
        return None, 0
    start_text = page.find('>', start_section)
    end_quote = page.find('</SECTION>', start_text + 1)
    section = page[start_text + 1:end_quote]
    return section, end_quote

def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
    start_section_number = section.find('<SECTNO>###')
    if start_section_number == -1:
        return None, 0
    end_section_number = section.find('</SECTNO>', start_section_number)
    section_number = section[start_section_number+11:end_section_number]
    return section_number, end_section_number

def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
    start_paragraph = section.find('<P>')
    if start_paragraph == -1:
        return None, 0
    end_paragraph = section.find('</P>', start_paragraph)
    paragraph = section[start_paragraph+3:end_paragraph]
    return start_paragraph, paragraph, end_paragraph


def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
    section, endpos = get_next_section(page)
    for pragraph in section:
        title = clean_title(file_name)
        volume = clean_volume(file_name)
        section, endpos = get_next_section(page)
        section_number, end_section_number = get_section_number(section)
        start_paragraph, paragraph, end_paragraph = get_paragraph(section)
        if paragraph:
            print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
            page = page[end_paragraph:]
        else:
            break

print print_all_paragraphs(page)
doc.close()

目前，此代码存在以下问题（示例输出）：

多次打印第一段。如何打印每个
标签及其标题号，卷号等？
CFR的空白部分为“保留”。这些部分没有
标记，因此if循环中断。我已经尝试实现for / while循环，但出于某种原因，当我这样做时，代码然后只打印它重复找到的第一个段落。

以下是输出的示例：

Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member 

of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number:  9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number:  9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled ‚ÄúClassification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.‚Äù
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled ‚ÄúClassification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.‚Äù
None

理想情况下，引文信息后的每个条目都不同。

我应该运行什么样的循环来正确打印？是否有更“pythonic”的方式来进行这种文本提取？

我知道我是一个完整的新手，我面临的一个主要问题是我根本没有词汇或主题知识来真正找到有关解析XML的详细答案。任何推荐的阅读也是受欢迎的。

Answer 1

我喜欢用XPATH或XSLT来解决这样的问题。您可以在lxml中找到一个很棒的实现（不是在标准发行版中，需要安装）。例如，XPATH // CHAPTER / HD / SECTION [SECTNO]选择包含数据的所有部分。您使用相对XPATH语句从那里获取所需的值。多个嵌套for循环消失。 XPATH有一点学习曲线，但那里有很多例子。

如何使用python解析XML层次结构？

1 个答案: