我正在使用此数据集练习使用python进行一些文本挖掘 https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat
所有内容都格式正确,但有些条目如下:
6898,"RAAF Williams, Laverton Base","Laverton","Australia",\N,"YLVT",-37.86360168457031,144.74600219726562,18,10,"O","Australia/Hobart","airport","OurAirports"
6899,"Nowra Airport","Nowra","Australia","NOA","YSNW",-34.94889831542969,150.53700256347656,400,10,"O","Australia/Sydney","airport","OurAirports"
在其名称中使用逗号并生成不规则列表,因为它会创建相同核心元素(名称)的多个元素
我将每行分配到列表的代码:
with open (filename) as txt:
for line in txt:
linea = line.split(',')
linea[3]=linea[3].strip('"')
我的主要问题是linea[3]
在这种情况下应该是国家australia
,但它会返回Laverton
。
我也尝试过csv库,几乎没有差别。
同样相关:我的代码为该条目返回
['6898', 'RAAF Williams, Laverton Base', 'Laverton', 'Australia', '\\N', 'YLVT', '-37.86360168457031', '144.74600219726562', '18', '10', 'O', 'Australia/Hobart', 'airport', 'OurAirports']
答案 0 :(得分:2)
Python已经支持csv解析很长时间了。Refer this link.
您需要在解析器中使用quotechar
。基本上,两次出现的quotechar之间的任何逗号都将被忽略。
例如:
import csv
with open (filename) as csvfile:
csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in csvreader:
# do something with the row
print row
答案 1 :(得分:0)
如果您能够更改为另一个包:您可以使用pandas读取该文件:
import pandas as pd
df = pd.read_csv(filename, sep=',')
print df
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 6898 RAAF Williams, Laverton Base Laverton Australia \N YLVT -37.863602 144.746002 18 10 O Australia/Hobart airport OurAirports
1 6899 Nowra Airport Nowra Australia NOA YSNW -34.948898 150.537003 400 10 O Australia/Sydney airport OurAirports
# this line will give you the same output structure as you have with the csv package (i.e. the list of lists)
df.as_matrix()
[[6898 'RAAF Williams, Laverton Base' 'Laverton' 'Australia' '\\N' 'YLVT'
-37.86360168457031 144.74600219726562 18 10 'O' 'Australia/Hobart'
'airport' 'OurAirports ']
[6899 'Nowra Airport' 'Nowra' 'Australia' 'NOA' 'YSNW' -34.948898315429695
150.53700256347656 400 10 'O' 'Australia/Sydney' 'airport' 'OurAirports']]