我有以下pandas数据帧:
DB Table Column Format
Retail Orders ID INTEGER
Retail Orders Place STRING
Dept Sales ID INTEGER
Dept Sales Name STRING
我想在表上循环,同时生成用于创建表的SQL。例如
create table Retail.Orders ( ID INTEGER, Place STRING)
create table Dept.Sales ( ID INTEGER, Name STRING)
我已经做的是获得不同的数据库和数据表使用drop_duplicate
,然后为每个表应用一个过滤器并连接字符串以创建一个sql。
def generate_tables(df_cols):
tables = df_cols.drop_duplicates(subset=[KEY_DB, KEY_TABLE])[[KEY_DB, KEY_TABLE]]
for index, row in tables.iterrows():
db = row[KEY_DB]
table = row[KEY_TABLE]
print("DB: " + db)
print("Table: " + table)
sql = "CREATE TABLE " + db + "." + table + " ("
cols = df_cols.loc[(df_cols[KEY_DB] == db) & (df_cols[KEY_TABLE] == table)]
for index, col in cols.iterrows():
sql += col[KEY_COLUMN] + " " + col[KEY_FORMAT] + ", "
sql += ")"
print(sql)
是否有更好的方法来迭代数据帧?
答案 0 :(得分:1)
如果你想要循环,那么是.iterrows()是通过pandas框架的最有效方式。编辑:从其他答案,并链接到这里 - Does iterrows have performance issues? - 我相信.itertuples()实际上是一个更好的性能生成器。
但是,根据数据帧的大小,您可能最好使用一些pandas groupby函数来协助
考虑这样的事情
# Add a concatenation of the column name and format
df['col_format'] = df['Column'] + ' ' + df['Format']
# Now create a frame which is the groupby of the DB/Table rows and
# concatenates the tuples of col_format correctly
y1 = (df.groupby(by=['DB', 'Table'])['col_format']
.apply(lambda x: '(' + ', '.join(x) + ')'))
# Reset the index to bring the keys/indexes back in as columns
y2 = y1.reset_index()
# Now create a Series of all of the SQL statements
all_outs = 'Create Table ' + y2['DB'] + '.' + y2['Table'] + ' ' + y2['col_format']
# Look at them!
all_outs.values
Out[44]:
array(['Create Table Dept.Sales (ID INTEGER, Name STRING)',
'Create Table Retail.Orders (ID INTEGER, Place STRING)'], dtype=object)
希望这有帮助!
答案 1 :(得分:1)
这就是我这样做的方式。首先通过df.itertuples
创建字典[比df.iterrows
更高效],然后使用str.format
无缝地包含值。
使用set
确保词典构造的唯一性。
我也转换为生成器,以便您可以根据需要有效地迭代它;总是可以通过list
排出发电机,如下所示。
from collections import defaultdict
d = defaultdict(set)
for row in df.itertuples():
d[(row[1], row[2])].add((row[3], row[4]))
def generate_tables_jp(d):
for k, v in d.items():
yield 'CREATE TABLE {0}.{1} ({2})'\
.format(k[0], k[1], ', '.join([' '.join(i) for i in v]))
list(generate_tables_jp(d))
结果:
['CREATE TABLE Retail.Orders (ID INTEGER, Place STRING)',
'CREATE TABLE Dept.Sales (ID INTEGER, Name STRING)']
答案 2 :(得分:0)
您可以先使用额外列中的每行汇编信息,然后使用groupby.sum
queries = df[KEY_COLUMN] + ' ' + df[KEY_FORMAT] + ', '
queries.index = df.set_index(index_labels).index
DB Table Retail Orders ID INTEGER, Orders Place STRING, Dept Sales ID INTEGER, Sales Name STRING, dtype: object
queries = queries.groupby(index_labels).sum().str.strip(', ')
DB Table Dept Sales ID INTEGER, Name STRING Retail Orders ID INTEGER, Place STRING dtype: object
def format_queries(queries):
query_pattern = 'CREATE TABLE %s.%s (%s)'
for (db, table), text in queries.items():# idx, table, text
query = query_pattern % (db, table, text)
yield query
list(format_queries(queries))
['CREATE TABLE Dept.Sales (ID INTEGER, Name STRING)', 'CREATE TABLE Retail.Orders (ID INTEGER, Place STRING)']
这样您就不需要lambda
。我不知道这种方法或itertuples
是否会最快