Question

我有一个全球每个主要机场的经纬度数据库。我只需要一个单独的.csv文件中列出的一部分（特别是在美国）。

此csv文件有两列，我从两列中提取了数据：始发机场代码（IATA代码）和目的地机场代码（也是IATA）。

我的数据库有一个用于IATA的列，基本上我想查询该数据库以获取我拥有的两个列表中每个机场的纬度/经度坐标。

这是我的代码：

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///airport_coordinates.db')

# The dataframe that contains the IATA codes for the airports I need
airport_relpath = "data/processed/%s_%s_combined.csv" % (file, airline)
script_dir = os.path.dirname(os.getcwd())
temp_file = os.path.join(script_dir, airport_relpath)
fields = ["Origin_Airport_Code", "Destination_Airport_Code"]
df_airports = pd.read_csv(temp_file, usecols=fields)

# the origin/destination IATA codes for the airports I need
origin = df_airports.Origin_Airport_Code.values
dest = df_airports.Destination_Airport_Code.values

# query the database for the lat/long coords of the airports I need
sql = ('SELECT lat, long FROM airportCoords WHERE iata IN %s' %(origin))
indexcols = ['lat', 'long']

df_origin = pd.read_sql(sql, engine)
# testing the origin coordinates
print(df_origin)

这是我得到的错误：

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such 
table: 'JFK' 'JFK' 'JFK' ... 'MIA' 'JFK' 'MIA' [SQL: "SELECT lat, long 
FROM airportCoords WHERE iata IN ['JFK' 'JFK' 'JFK' ... 'MIA' 'JFK' 
'MIA']"] (Background on this error at: http://sqlalche.me/e/e3q8)

这肯定是因为我没有正确查询它（因为它认为我的查询应该针对表）。

我尝试遍历该列表以单独查询每个元素，但是该列表包含超过604,885个元素，并且我的计算机无法提供任何输出。

Answer 1

您的错误在于使用字符串插值：

sql = ('SELECT lat, long FROM airportCoords WHERE iata IN %s' %(origin))

由于origin是一个Numpy数组，因此在查询中产生[....] SQL标识符语法；参见SQLite documentation：

如果要使用关键字作为名称，则需要引用它。在SQLite中有四种引用关键字的方法：

[...]
   [keyword] 用方括号括起来的关键字是一个标识符。 [...]

您要求SQLite检查iata是否在名为['JFK' 'JFK' 'JFK' ... 'MIA' 'JFK' 'MIA']的表中，因为那是Numpy数组的字符串表示形式。

您已经在使用SQLAlchemy，如果您使用该库为您生成所有SQL，包括IN (....)成员资格测试，则会更容易：

from sqlalchemy import *

filter = literal_column('iata', String).in_(origin)
sql = select([
    literal_column('lat', Float),
    literal_column('long', Float),
]).select_from(table('airportCoords')).where(filter)

然后通过sql作为查询。

我在这里使用了literal_column()和table()对象来直接快捷地指向对象的名称，但是您也可以直接从您的engine对象中请求SQLAlchemy来reflect your database table已经创建，然后使用结果表定义来生成查询：

metadata = MetaData()
airport_coords = Table('airportCoords', metadata, autoload=True, autoload_with=engine)

此时查询将被定义为：

filter = airport_coords.c.iata.in_(origin)
sql = select([airport_coords.c.lat, airport_coords.c.long]).where(filter)

我还将在输出中包含iata代码，否则您将没有返回将IATA代码连接到匹配坐标的路径：

sql = select([airport_coords.c.lat, airport_coords.c.long, airport_coords.c.iata]).where(filter)

接下来，正如您所说的，列表中有604,885个元素，因此可能要将该CSV数据加载到临时表中，以保持查询效率：

engine = create_engine('sqlite:///airport_coordinates.db')

# code to read CSV file
# ...
df_airports = pd.read_csv(temp_file, usecols=fields)

# SQLAlchemy table wrangling
metadata = MetaData()
airport_coords = Table('airportCoords', metadata, autoload=True, autoload_with=engine)
temp = Table(
    "airports_temp",
    metadata,
    *(Column(field, String) for field in fields),
    prefixes=['TEMPORARY']
)
with engine.begin() as conn:
    # insert CSV values into a temporary table in SQLite
    temp.create(conn, checkfirst=True)
    df_airports.to_sql(temp.name), engine, if_exists='append')

# Join the airport coords against the temporary table
joined = airport_coords.join(temp, airport_coords.c.iata==temp.c.Origin_Airport_Code)

# select coordinates per airport, include the iata code
sql = select([airport_coords.c.lat, airport_coords.c.long, airport_coords.c.iata]).select_from(joined)
df_origin = pd.read_sql(sql, engine)

如何使用值列表查询数据库？

1 个答案: