将包含带有标记部分的OrderedDict的元组转换为具有从标记部分命名的列的表

时间:2015-04-21 20:09:40

标签: python transpose

标题更完整:将包含带有标记部分的OrderedDict的元组转换为具有从标记部分命名的列的表(标记部分的可变数量和标记出现的可变数量)。

我比python更了解地址解析,这可能是问题的根本原因。怎么做可能是显而易见的。 usaddress库有意以这种方式返回结果,这可能是有用的。

我正在使用usaddress“这是一个python库,用于使用高级NLP方法将非结构化地址字符串解析为地址组件”,并且似乎运行良好。这是the usaddress sourcewebsite

所以我在像以下文件上运行它:

2244 NE 29TH DR
1742 NW 57TH ST
1241 NE EAST DEVILS LAKE RD 
4239 SW HWY 101, UNIT 19 
1315 NE HARBOR RIDGE 
4850 SE 51ST ST 
1501 SE EAST DEVILS LAKE RD 
1525 NE REGATTA WAY 
6458 NE MAST AVE 
4009 SW HWY 101 
814 SW 9TH ST 
1665 SALMON RIVER HWY 
3500 NE WEST DEVILS LAKE RD, UNIT 18 
1912 NE 56TH DR 
3334 NE SURF AVE 
2734 SW DUNE CT
2558 NE 33RD ST 
2600 NE 33RD ST 
5617 NW JETTY AVE 

我希望将这些结果转换为更像表格(最终是CSV或数据库)的内容。

我不确定返回的是哪种数据类型。阅读文档,告诉我tag方法返回一个元组,其中包含带有标记部分的OrderedDict。解析方法似乎返回稍微不同的类型。 This question,帮助我确定它是一个列表和一个元组(显然是带标签)。搜索for how to convert a python list with tagged parts to a table失败了。

搜索如何转换包含OrderedDict的元组并没有太多变化。 This是我发现的最接近的。我还发现pandas擅长各种格式化任务,虽然我不清楚如何将pandas应用于此。我发现的许多最接近的问题like the opposite question or one with named tuples得分非常低。

我还尝试了一些探索性的尝试,看看它是否会起作用(下图)。我能够看到几种访问数据的方法,并且使用来自此Matrix Transpose question的zip更接近表格,因为数据和命名标签现在是分​​开的,尽管不是统一的。有没有办法将这些结果带到包含带有标记部分的OrderedDict的标记列表或元组中?从返回的结果中有一个相当直接的方法吗?

这是解析方法:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line 
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
    ## Try the parse method
    parsed = usaddress.parse(line)
    ## See what the parse results look like
    zippy = [list(i) for i in zip(*parsed)]
    print(zippy)
    ## read the next line
    line = f.readline()

## close the file
f.close()

结果产生(注意当标签有多个部分时会重复)。

[['2244', 'NE', '29TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1742', 'NW', '57TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1241', 'NE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['4239', 'SW', 'HWY', '101,', 'UNIT', '19'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier']]
[['1315', 'NE', 'HARBOR', 'RIDGE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4850', 'SE', '51ST', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1501', 'SE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['1525', 'NE', 'REGATTA', 'WAY'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['6458', 'NE', 'MAST', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4009', 'SW', 'HWY', '101'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName']]
[['814', 'SW', '9TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1665', 'SALMON', 'RIVER', 'HWY'], ['AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['3500', 'NE', 'WEST', 'DEVILS', 'LAKE', 'RD,', 'UNIT', '18'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier']]
[['1912', 'NE', '56TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['3334', 'NE', 'SURF', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2734', 'SW', 'DUNE', 'CT'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2558', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2600', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['5617', 'NW', 'JETTY', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]

这是标签方法:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line 
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
    ## Try tag method
    tagged = usaddress.tag(line)
    ## See what the tag results look like
    items_ = list(tagged[0].items())
    zippy2 = [list(i) for i in zip(*items_)]
    print(zippy2)
    ## read the next line
    line = f.readline()

## close the file
f.close()

生成以下输出,以便更好地处理具有相同标记的多个部分的组合:

[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2244', 'NE', '29TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1742', 'NW', '57TH', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1241', 'NE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier'], ['4239', 'SW', 'HWY', '101', 'UNIT', '19']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1315', 'NE', 'HARBOR', 'RIDGE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['4850', 'SE', '51ST', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1501', 'SE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1525', 'NE', 'REGATTA', 'WAY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['6458', 'NE', 'MAST', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName'], ['4009', 'SW', 'HWY', '101']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['814', 'SW', '9TH', 'ST']]
[['AddressNumber', 'StreetName', 'StreetNamePostType'], ['1665', 'SALMON RIVER', 'HWY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier'], ['3500', 'NE', 'WEST DEVILS LAKE', 'RD', 'UNIT', '18']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1912', 'NE', '56TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['3334', 'NE', 'SURF', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2734', 'SW', 'DUNE', 'CT']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2558', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2600', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['5617', 'NW', 'JETTY', 'AVE']]

1 个答案:

答案 0 :(得分:1)

只需在标记方法中使用csv.DictWriter类:

from csv import DictWriter
import usaddress

tagged_lines = []
fields = set()
# Note 1: Use the 'with' statement instead of worrying about opening
# and closing your file manually
with open('address_sample.txt') as in_file:
    # Note 2: You don't need to mess with readline() and while loops; 
    # just iterate over the file handle directly, it produces lines.
    for line in in_file:
        tagged = usaddress.tag(line)[0]
        tagged_lines.append(tagged)
        fields.update(tagged.keys()) # keep track of all field names we see

with open('address_sample.csv', 'w') as out_file:
    writer = DictWriter(out_file, fieldnames=fields)
    writer.writeheader()
    writer.writerows(tagged_lines)

请注意,对于大型文件来说这是低效的,因为它会立即将输入的全部内容保存在内存中;唯一的原因是字段集(即csv列标题)事先是未知的。

如果您知道完整集,则可以在一个流传递中执行此操作,在读取每一行时写入标记输出。或者,您可以对文件执行一次传递以生成标头集,然后再进行第二次传递以进行转换。