Python的临时“连接”

时间:2019-01-21 22:02:11

标签: python relational-database

我需要将一些数据与同事生成的数据合并。这让我想起了SQL数据库中的JOIN,但是我们没有使用数据库,只是一个Excel文件或.csv文件中的几十个条目,每个条目都有几列。

是否可以使用Python库在临时内存数据库中查看这些数据结构并使用OUTER JOIN进行合并?

我的数据示例:

Atomic symbol   Atomic number
H               1
He              2
Be              4
Si              14
Fe              26
U               92
Pu              94

他的数据:

Atomic symbol   Name       Hazard
H               Hydrogen   ignition, combustion
Be              Beryllium  dust is toxic
As              Arsenic    toxic
Pu              Plutonium  dust is toxic

2 个答案:

答案 0 :(得分:2)

如果您有熊猫,使用DataFrame.merge是最方便的方法:

import pandas as pd
from io import StringIO

my_data = '''\
Atomic symbol   Atomic number
H               1
He              2
Be              4
Si              14
Fe              26
U               92
Pu              94'''

his_data = '''\
Atomic symbol   Name       Hazard
H               Hydrogen   ignition, combustion
Be              Beryllium  dust is toxic
As              Arsenic    toxic
Pu              Plutonium  dust is toxic'''

my_df = pd.read_csv(StringIO(my_data), sep='\s{2,}')
his_df = pd.read_csv(StringIO(his_data), sep='\s{2,}')
joined_df = pd.merge(my_df, his_df, on=['Atomic symbol'], how='outer')
print(joined_df)

收益

  Atomic symbol  Atomic number       Name                Hazard
0             H            1.0   Hydrogen  ignition, combustion
1            He            2.0        NaN                   NaN
2            Be            4.0  Beryllium         dust is toxic
3            Si           14.0        NaN                   NaN
4            Fe           26.0        NaN                   NaN
5             U           92.0        NaN                   NaN
6            Pu           94.0  Plutonium         dust is toxic
7            As            NaN    Arsenic                 toxic

或者您可以使用sqlite3,它是Python标准库的一部分。 sqlite does not currently support FULL OUTER JOINs,但是,您将必须通过使用LEFT JOIN和UNION自己构造OUTER JOIN:

import sqlite3
import csv
from io import StringIO

my_data = '''\
"Atomic symbol","Atomic number"
"H","1"
"He","2"
"Be","4"
"Si","14"
"Fe","26"
"U","92"
"Pu","94"'''

his_data = '''\
"Atomic symbol","Name","Hazard"
"H","Hydrogen","ignition, combustion"
"Be","Beryllium","dust is toxic"
"As","Arsenic","toxic"
"Pu","Plutonium","dust is toxic"'''


with sqlite3.connect(':memory:') as conn:
    cursor = conn.cursor()
    sql = '''CREATE TABLE my_data
             (my_data_id INTEGER PRIMARY KEY AUTOINCREMENT,
              Atomic_symbol TEXT,
              Atomic_number INTEGER)'''
    cursor.execute(sql)

    my_data = csv.reader(StringIO(my_data), delimiter=',', quotechar='"')
    next(my_data)
    sql = '''INSERT INTO my_data (Atomic_symbol, Atomic_number) VALUES (?, ?)'''
    cursor.executemany(sql, my_data)

    sql = '''CREATE TABLE his_data
             (his_data_id INTEGER PRIMARY KEY AUTOINCREMENT,
              Atomic_symbol TEXT,
              Name TEXT,
              Hazard TEXT)'''
    cursor.execute(sql)
    his_data = csv.reader(StringIO(his_data), delimiter=',', quotechar='"')
    next(his_data)
    sql = '''INSERT INTO his_data (Atomic_symbol, Name, Hazard) VALUES (?, ?, ?)'''    
    cursor.executemany(sql, his_data)

    sql = '''SELECT m.Atomic_symbol, m.Atomic_number, h.Name, h.Hazard 
             FROM my_data m
             LEFT JOIN his_data h
             USING (Atomic_symbol)
             UNION ALL
             SELECT h.Atomic_symbol, m.Atomic_number, h.Name, h.Hazard 
             FROM his_data h
             LEFT JOIN my_data m
             USING (Atomic_symbol)
             WHERE m.Atomic_symbol is NULL'''
    cursor.execute(sql)
    result = cursor.fetchall()
    print('\n'.join([' '.join(map('{:10}'.format, map(str, row))) for row in result]))

收益

H          1          Hydrogen   ignition, combustion
He         2          None       None      
Be         4          Beryllium  dust is toxic
Si         14         None       None      
Fe         26         None       None      
U          92         None       None      
Pu         94         Plutonium  dust is toxic
As         None       Arsenic    toxic     

答案 1 :(得分:1)

您可以将其作为内存SQL加载,也可以使用熊猫。

假设我们有两个上面定义的CSV文件:

/tmp/x.csv

"Atomic symbol","Atomic number"
"H",1
"He",2
"Be",4
"Si",14
"Fe",26
"U",92
"Pu",94

/tmp/y.csv

"Atomic symbol","Name","Hazard"
"H","Hydrogen","ignition, combustion"
"Be","Beryllium","dust is toxic"
"As","Arsenic","toxic"
"Pu","Plutonium","dust is toxic"

熊猫:

import pandas as pd
pd.set_option('display.max_columns', 100)

x = pd.read_csv('/tmp/x.csv')
y = pd.read_csv('/tmp/y.csv')
result = pd.merge(x, y, on=['Atomic symbol'], how='outer')

print(x)
print(y)
print(result)

 Atomic symbol  Atomic number
0             H              1
1            He              2
...

  Atomic symbol       Name                Hazard
0             H   Hydrogen  ignition, combustion
1            Be  Beryllium         dust is toxic
2            As    Arsenic                 toxic
...

  Atomic symbol  Atomic number       Name                Hazard
0             H            1.0   Hydrogen  ignition, combustion
1            He            2.0        NaN                   NaN
2            Be            4.0  Beryllium         dust is toxic
...

内存中的SQL:

import csv, sqlite3

connection = sqlite3.connect(":memory:")

def load_into_table(con, table_name, file_name):
    with open(file_name) as f:
        dr = csv.DictReader(f)

        fields = ', '.join(['`{}`'.format(f) for f in dr.fieldnames])
        values = ', '.join(['?' for _ in dr.fieldnames])

        query = "CREATE TABLE {table_name} ({fields});".format(table_name=table_name, fields=fields)

        con.execute(query)

        to_db = [list(i.values()) for i in dr]

        insert_query = "INSERT INTO {table_name} VALUES ({values});".format(table_name=table_name, fields=fields, values=values)

        con.executemany(insert_query, to_db)
        con.commit()

load_into_table(con=connection, table_name='x', file_name='/tmp/x.csv')
load_into_table(con=connection, table_name='y', file_name='/tmp/y.csv')

print(connection.execute('SELECT * FROM x').fetchall())
print(connection.execute('SELECT * FROM y').fetchall())
print(connection.execute('SELECT * FROM x LEFT JOIN y ON x.`Atomic symbol` = y.`Atomic symbol`; ').fetchall())

[('H', '1'), ('He', '2'), ('Be', '4'), ...]
[('H', 'Hydrogen', 'ignition, combustion'), ('Be', 'Beryllium', 'dust is toxic'), ...]
[('H', '1', 'H', 'Hydrogen', 'ignition, combustion'), ('He', '2', None, None, None), ...]

注意:SQLite不支持外部联接。 您可以模仿它:http://www.sqlitetutorial.net/sqlite-full-outer-join/