我正在做一个项目,涉及以某种格式创建美国联邦代码的rdbms。 我已经获得了整个代码形式的官方来源,结构不合理。 我已经设法使用GITHUB上的一些代码将以下格式的US Code压缩成文本文件。
可以使用Python脚本将其写入以下格式的某些csv或flat文件吗?
我是Python的新手,但我被告知可以使用Python轻松完成。
结束输出将是平面文件或具有以下架构的csv文件:
示例:
**Title | Text | Chapter | text | Section | Text | Section text**
1 | GENERAL PROVISIONS | 1 | RULES OF CONSTRUCTION | 2 | "County" as including "parish", and so forth | The word "county" includes a parish, or any other equivalent subdivision of a State or Territory of the United States.
输入将是一个文本文件,其数据如下所示。
Sample data:
-CITE-
1 USC Sec. 2 01/15/2013
-EXPCITE-
TITLE 1 - GENERAL PROVISIONS
CHAPTER 1 - RULES OF CONSTRUCTION
-HEAD-
Sec. 2. "County" as including "parish", and so forth
-STATUTE-
The word "county" includes a parish, or any other equivalent
subdivision of a State or Territory of the United States.
-SOURCE-
(July 30, 1947, ch. 388, 61 Stat. 633.)
-End-
-CITE-
1 USC Sec. 3 01/15/2013
-EXPCITE-
TITLE 1 - GENERAL PROVISIONS
CHAPTER 1 - RULES OF CONSTRUCTION
-HEAD-
Sec. 3. "Vessel" as including all means of water transportation
-STATUTE-
The word "vessel" includes every description of watercraft or
other artificial contrivance used, or capable of being used, as a
means of transportation on water.
-SOURCE-
(July 30, 1947, ch. 388, 61 Stat. 633.)
-End-
答案 0 :(得分:2)
如果您想使用强大的解析器(如pyparsing
而不是正则表达式),以下内容应该适用于您:
import csv, re
from pyparsing import Empty, FollowedBy, Group, LineEnd, Literal, \
OneOrMore, Optional, Regex, SkipTo, Word
from pyparsing import alphanums, alphas, nums
def section(header, other):
return Literal('-'+header+'-').suppress() + other
def tc(header, next_item):
# <header> <number> - <name>
begin = Literal(header).suppress()
number = Word(nums)\
.setResultsName('number')\
.setParseAction(compress_whitespace)
dash = Literal('-').suppress()
name = SkipTo(Literal(next_item))\
.setResultsName('name')\
.setParseAction(compress_whitespace)
return begin + number + dash + name
def compress_whitespace(s, loc, toks):
return [re.sub(r'\s+', ' ', tok).strip() for tok in toks]
def parse(data):
# should match anything that looks like a header
header = Regex(re.compile(r'-[A-Z0-9]+-'))
# -CITE- (ignore)
citation = SkipTo('-EXPCITE-').suppress()
cite_section = section('CITE', citation)
# -EXPCITE- (parse)
# grab title number, title name, chapter number, chapter name
title = Group(tc('TITLE', 'CHAPTER'))\
.setResultsName('title')
chapter = Group(tc('CHAPTER', '-HEAD-'))\
.setResultsName('chapter')
expcite_section = section('EXPCITE', title + chapter)
# -HEAD- (parse)
# two possible forms of section number:
# > Sec. 1. <head_text>
# > CHAPTER 1 - <head_text>
sec_number1 = Literal("Sec.").suppress() \
+ Regex(r'\d+\w?.')\
.setResultsName('section')\
.setParseAction(lambda s, loc, toks: toks[0][:-1])
sec_number2 = Literal("CHAPTER").suppress() \
+ Word(nums)\
.setResultsName('section') \
+ Literal("-")
sec_number = sec_number1 | sec_number2
head_text = SkipTo(header)\
.setResultsName('head')\
.setParseAction(compress_whitespace)
head = sec_number + head_text
head_section = section('HEAD', head)
# -STATUTE- (parse)
statute = SkipTo(header)\
.setResultsName('statute')\
.setParseAction(compress_whitespace)
statute_section = section('STATUTE', statute)
# -End- (ignore)
end_section = SkipTo('-End-', include=True)
# do parsing
parser = OneOrMore(Group(cite_section \
+ expcite_section \
+ head_section \
+ Optional(statute_section) \
+ end_section))
result = parser.parseString(data)
return result
def write_to_csv(parsed_data, filename):
with open(filename, 'w') as f:
writer = csv.writer(f, lineterminator='\n')
for item in parsed_data:
if 'statute' not in item:
continue
row = [item['title']['number'],
item['title']['name'],
item['chapter']['number'],
item['chapter']['name'],
item['section'],
item['head'],
item['statute']]
writer.writerow(row)
# your data is assumed to be in <source.txt>
with open('source.txt', 'r') as f:
data = f.read()
result = parse(data)
write_to_csv(result, 'output.txt')
输出:见http://pastie.org/8654063。
这肯定比使用正则表达式更冗长,但在我看来它也更易于维护和扩展。 (当然,这带来了学习如何在pyparsing
中进行基本操作的开销,这不一定是微不足道的。)
响应您的请求 - 我更新了解析器,以容纳您链接到我的文件中显示的所有文本。它现在应该更加强大,以防止异常的换行符/标点符号。
根据您的要求,输出中不再包含枚举部分(并且缺少-STATUTE-
部分)的引文。
答案 1 :(得分:0)
1.浏览文件行
with open('workfile', 'r') as f:
for line in f:
...
2.使用python re匹配['CITE', 'EXPCITE', 'HEAD'...]
3.基于匹配为2的行,也使用python re
来匹配行内容,考虑在某些词典中准备好这些匹配器
d = {'EXPCITE': re.compile(pattern)}
# and then latter
m = d['EXPCITE'].match(string)
# get the relevant group, for exmaple
print m.group(0)
4.写入csv output file
with open('out.csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter='|')
writer.writerow([...])
另外,考虑实现一个状态机在上面的第2点到第3点之间切换,请参阅Python state-machine design使用这种技术,你可以在寻找第2点所述的标签之间切换,以匹配标签内容,如上所述在第3点
祝你好运!