我需要将一些数据与同事生成的数据合并。这让我想起了SQL数据库中的JOIN,但是我们没有使用数据库,只是一个Excel文件或.csv文件中的几十个条目,每个条目都有几列。
是否可以使用Python库在临时内存数据库中查看这些数据结构并使用OUTER JOIN进行合并?
我的数据示例:
Atomic symbol Atomic number
H 1
He 2
Be 4
Si 14
Fe 26
U 92
Pu 94
他的数据:
Atomic symbol Name Hazard
H Hydrogen ignition, combustion
Be Beryllium dust is toxic
As Arsenic toxic
Pu Plutonium dust is toxic
答案 0 :(得分:2)
如果您有熊猫,使用DataFrame.merge
是最方便的方法:
import pandas as pd
from io import StringIO
my_data = '''\
Atomic symbol Atomic number
H 1
He 2
Be 4
Si 14
Fe 26
U 92
Pu 94'''
his_data = '''\
Atomic symbol Name Hazard
H Hydrogen ignition, combustion
Be Beryllium dust is toxic
As Arsenic toxic
Pu Plutonium dust is toxic'''
my_df = pd.read_csv(StringIO(my_data), sep='\s{2,}')
his_df = pd.read_csv(StringIO(his_data), sep='\s{2,}')
joined_df = pd.merge(my_df, his_df, on=['Atomic symbol'], how='outer')
print(joined_df)
收益
Atomic symbol Atomic number Name Hazard
0 H 1.0 Hydrogen ignition, combustion
1 He 2.0 NaN NaN
2 Be 4.0 Beryllium dust is toxic
3 Si 14.0 NaN NaN
4 Fe 26.0 NaN NaN
5 U 92.0 NaN NaN
6 Pu 94.0 Plutonium dust is toxic
7 As NaN Arsenic toxic
或者您可以使用sqlite3,它是Python标准库的一部分。 sqlite
does not currently support FULL OUTER JOINs,但是,您将必须通过使用LEFT JOIN和UNION自己构造OUTER JOIN:
import sqlite3
import csv
from io import StringIO
my_data = '''\
"Atomic symbol","Atomic number"
"H","1"
"He","2"
"Be","4"
"Si","14"
"Fe","26"
"U","92"
"Pu","94"'''
his_data = '''\
"Atomic symbol","Name","Hazard"
"H","Hydrogen","ignition, combustion"
"Be","Beryllium","dust is toxic"
"As","Arsenic","toxic"
"Pu","Plutonium","dust is toxic"'''
with sqlite3.connect(':memory:') as conn:
cursor = conn.cursor()
sql = '''CREATE TABLE my_data
(my_data_id INTEGER PRIMARY KEY AUTOINCREMENT,
Atomic_symbol TEXT,
Atomic_number INTEGER)'''
cursor.execute(sql)
my_data = csv.reader(StringIO(my_data), delimiter=',', quotechar='"')
next(my_data)
sql = '''INSERT INTO my_data (Atomic_symbol, Atomic_number) VALUES (?, ?)'''
cursor.executemany(sql, my_data)
sql = '''CREATE TABLE his_data
(his_data_id INTEGER PRIMARY KEY AUTOINCREMENT,
Atomic_symbol TEXT,
Name TEXT,
Hazard TEXT)'''
cursor.execute(sql)
his_data = csv.reader(StringIO(his_data), delimiter=',', quotechar='"')
next(his_data)
sql = '''INSERT INTO his_data (Atomic_symbol, Name, Hazard) VALUES (?, ?, ?)'''
cursor.executemany(sql, his_data)
sql = '''SELECT m.Atomic_symbol, m.Atomic_number, h.Name, h.Hazard
FROM my_data m
LEFT JOIN his_data h
USING (Atomic_symbol)
UNION ALL
SELECT h.Atomic_symbol, m.Atomic_number, h.Name, h.Hazard
FROM his_data h
LEFT JOIN my_data m
USING (Atomic_symbol)
WHERE m.Atomic_symbol is NULL'''
cursor.execute(sql)
result = cursor.fetchall()
print('\n'.join([' '.join(map('{:10}'.format, map(str, row))) for row in result]))
收益
H 1 Hydrogen ignition, combustion
He 2 None None
Be 4 Beryllium dust is toxic
Si 14 None None
Fe 26 None None
U 92 None None
Pu 94 Plutonium dust is toxic
As None Arsenic toxic
答案 1 :(得分:1)
您可以将其作为内存SQL加载,也可以使用熊猫。
假设我们有两个上面定义的CSV文件:
/tmp/x.csv
"Atomic symbol","Atomic number"
"H",1
"He",2
"Be",4
"Si",14
"Fe",26
"U",92
"Pu",94
和/tmp/y.csv
"Atomic symbol","Name","Hazard"
"H","Hydrogen","ignition, combustion"
"Be","Beryllium","dust is toxic"
"As","Arsenic","toxic"
"Pu","Plutonium","dust is toxic"
import pandas as pd
pd.set_option('display.max_columns', 100)
x = pd.read_csv('/tmp/x.csv')
y = pd.read_csv('/tmp/y.csv')
result = pd.merge(x, y, on=['Atomic symbol'], how='outer')
print(x)
print(y)
print(result)
→
Atomic symbol Atomic number
0 H 1
1 He 2
...
Atomic symbol Name Hazard
0 H Hydrogen ignition, combustion
1 Be Beryllium dust is toxic
2 As Arsenic toxic
...
Atomic symbol Atomic number Name Hazard
0 H 1.0 Hydrogen ignition, combustion
1 He 2.0 NaN NaN
2 Be 4.0 Beryllium dust is toxic
...
import csv, sqlite3
connection = sqlite3.connect(":memory:")
def load_into_table(con, table_name, file_name):
with open(file_name) as f:
dr = csv.DictReader(f)
fields = ', '.join(['`{}`'.format(f) for f in dr.fieldnames])
values = ', '.join(['?' for _ in dr.fieldnames])
query = "CREATE TABLE {table_name} ({fields});".format(table_name=table_name, fields=fields)
con.execute(query)
to_db = [list(i.values()) for i in dr]
insert_query = "INSERT INTO {table_name} VALUES ({values});".format(table_name=table_name, fields=fields, values=values)
con.executemany(insert_query, to_db)
con.commit()
load_into_table(con=connection, table_name='x', file_name='/tmp/x.csv')
load_into_table(con=connection, table_name='y', file_name='/tmp/y.csv')
print(connection.execute('SELECT * FROM x').fetchall())
print(connection.execute('SELECT * FROM y').fetchall())
print(connection.execute('SELECT * FROM x LEFT JOIN y ON x.`Atomic symbol` = y.`Atomic symbol`; ').fetchall())
→
[('H', '1'), ('He', '2'), ('Be', '4'), ...]
[('H', 'Hydrogen', 'ignition, combustion'), ('Be', 'Beryllium', 'dust is toxic'), ...]
[('H', '1', 'H', 'Hydrogen', 'ignition, combustion'), ('He', '2', None, None, None), ...]
注意:SQLite不支持外部联接。 您可以模仿它:http://www.sqlitetutorial.net/sqlite-full-outer-join/