Question

我已经倾倒了over this post，但答案似乎并不符合我的需要。但是，我对Python很新，所以也可能是问题。

这是output.csv中的一些行：
<<案例缔约方地址
25 THOMAS ST。，PORTAGE，IN
67 CHESTNUT ST。，MILLBROOK，NJ
1 EMPIRE DR。，AUSTIN，TX，11225
111华盛顿大街。＃404，VALPARAISO，AK
89 E. JERICHO TPKE。，Scarssdale，AZ

原始邮政编码

import usaddress
import csv

with open('output.csv') as csvfile:
reader = csv.DictReader(csvfile)
    for row in reader:
        addr=row['Case Parties Address']
        data = usaddress.tag(addr)
        print(data)

(OrderedDict([('AddressNumber', u'4167'), ('StreetNamePreType', u'Highway'), ('StreetName', u'319'), ('StreetNamePostDirectional', u'E'), ('PlaceName', u'Conway'), ('StateName', u'SC'), ('ZipCode', u'29526-5446')]), 'Street Address'

与上一篇文章非常相似，我需要将解析后的数据输出到csv中。尽我所知，我需要执行以下步骤：

提供标题作为参考列表。（They're listed here in 'Details'。）
使用Usaadress.tag（），将source_csv解析为“data”但保留相应的“密钥”。
将密钥：数据映射到header_reference
导出到具有一个标题行的output_csv。

我正在使用Python模块usaaddress来解析一个大的csv（200k +）。模块使用OrderedDict输出解析的数据。上述帖子仅在所有字段映射到所有记录的相同标题时才有效。但是，usaddress的许多好处之一是，即使没有要解析的字段，它也会解析数据。因此，例如，“123 Fake St，Maine，PA”完美映射到地址，城市，州标题。但是“123 Jumping Block，Suite 600，Maine，PA”将把“Suite 600”放在“city”列中，因为它根据位置静态匹配。如果我自己解析后者，usaddress提供地址，占用标识符（例如“套件＃”），城市，州标题。

我使用解析器的在线解析器时提供了我需要的输出格式，但它一次只能容纳500行。

似乎我的代码在通过模块进行路由之前不会知道每个数据点是什么;鸡蛋或鸡蛋的情况。当每行可能有不同的列子集时，如何将行写入CSV文件？

作为参考，我在尝试最接近的解决方案（由isosceleswheel提供）时得到的错误是valueerror：I / O（...）并且它们引用了csv.py库的第107行和第90行，两者都是属于字段名。

with open('output.csv') as csvfile:
reader = csv.DictReader(csvfile)

with open('myoutputfile', 'w') as o:  # this will be the new file you write to
    for row in reader:
        addr=row['Case Parties Address']
        data = usaddress.tag(addr)
        header = ','.join(data.keys()) + '\n'  # this will make a string of the header separated by comma with a newline at the end
        data_string = ','.join(data.values()) + '\n' # this will make a string of the values separated by comma with a newline at the end
        o.write(header + data_string)  # this will write the header and then the data on a new line with each field separated by commas

Answer 1

请参阅this github issue了解解决方案

因为我们知道usaddress中的所有可能标签，所以我们可以使用它们来定义输出中的字段。

我无法评论答案b / c我没有足够的声誉，但我建议不要使用usaddress parse方法执行此任务。 tag方法将解析一个地址，然后在它们具有相同标签时连接连续的地址令牌，并且如果存在具有相同标签的非连续令牌则会引发错误 - 在输出中捕获标记错误会很好。

Answer 2

您想要分别解析每个地址并存储在列表中。然后，您可以使用Pandas DataFrame来对齐输出。像这样：

import pandas as pd

data = ['Robie House, 5757 South Woodlawn Avenue, Chicago, IL 60637',
        'State & Lake, Chicago']

tagged_addresses = [usaddress.parse(line) for line in data]

address_df = pd.DataFrame(tagged_addresses)

print(address_df)

  AddressNumber BuildingName IntersectionSeparator PlaceName SecondStreetName StateName StreetName StreetNamePostType StreetNamePreDirectional ZipCode
0          5757  Robie House                   NaN   Chicago              NaN        IL   Woodlawn             Avenue                    South   60637
1           NaN          NaN                     &   Chicago             Lake       NaN      State                NaN                      NaN     NaN

将CHANGING OrderedDict输出为CSV

2 个答案: