我有一些像这样的样本数据:
MADISON COUNTY,,,,,,,,,,,,, "London, City of",,,,,,,,,,,,597,519
2.1,mill /s,(replacement),for,5 years,",",commencing in,2007,",",first due in calendar year,2008,",",, for current operating expenses
-,,,,,,,,,,,,, London Public Library District,,,,,,,,,,,,716,869 1.2,mill /s,(replacement),"& increase of 1.7 mills, for 15 years, commencing in 2007, first due in",,,,,,,,,, "calendar year 2008, for
current expenses -",,,,,,,,,,,,, "Range, Township of",,,,,,,,,,,,62,13
1.7,mill /s,(renewal),for,5 years,",",commencing in,2007,",",first due in calendar year,2008,",",, for fire protection -,,,,,,,,,,,,,
我最后需要的是所有“城镇”的列表,因此输出应为:
["London, City of", "London Public Library District", "Range, Township of"]
我在这里有点挣扎,因为我真的不知道如何将它缩小到这些领域。正如您所看到的那样,逗号系列是一个非常好的开始,但也有不需要的字符串逗号不遵循该模式。最初我以为我会匹配字符串两边的5个逗号,长度为< 100个字符,但是这里的任意逗号令人沮丧:
first due in",,,,,,,,,, "cale
任何线索?
此外,数据通常采用以下格式:
SOME COUNTY,,,,,,,,,,,,, SOME TOWN,,,,,,,,,,,,some long string possibly with commas
,,,,,,,,,,,,, SOME TOWN,,,,,,,,,,,,some long string possibly with commas ... etc
答案 0 :(得分:0)
我很难从你的样本数据中得知,因为我认为它有额外的换行符,但从你对数据格式的总结来看,Town似乎是每行中的第14列。
由于数据是CSV格式,您不需要使用正则表达式,而是可以使用the csv
module来解析数据。提取城镇名称应该像以下一样简单:
import csv
with open('data.csv') as f:
for row in csv.reader(f):
print row[13]