Question

我有以下pandas数据帧：

DB      Table   Column  Format

Retail  Orders  ID      INTEGER
Retail  Orders  Place   STRING
Dept    Sales   ID      INTEGER
Dept    Sales   Name    STRING

我想在表上循环，同时生成用于创建表的SQL。例如

create table Retail.Orders ( ID INTEGER, Place STRING)
create table Dept.Sales ( ID INTEGER, Name STRING)

我已经做的是获得不同的数据库和数据表使用drop_duplicate，然后为每个表应用一个过滤器并连接字符串以创建一个sql。

def generate_tables(df_cols):
    tables = df_cols.drop_duplicates(subset=[KEY_DB, KEY_TABLE])[[KEY_DB, KEY_TABLE]]

    for index, row in tables.iterrows():
        db = row[KEY_DB]
        table = row[KEY_TABLE]

        print("DB: " + db)
        print("Table: " + table)

        sql = "CREATE TABLE " + db + "." + table + " ("
        cols = df_cols.loc[(df_cols[KEY_DB] == db) & (df_cols[KEY_TABLE] == table)]
        for index, col in cols.iterrows():
            sql += col[KEY_COLUMN] + " " + col[KEY_FORMAT] + ", "

        sql += ")"

        print(sql)

是否有更好的方法来迭代数据帧？

Answer 1

如果你想要循环，那么是.iterrows（）是通过pandas框架的最有效方式。编辑：从其他答案，并链接到这里 - Does iterrows have performance issues? - 我相信.itertuples（）实际上是一个更好的性能生成器。

但是，根据数据帧的大小，您可能最好使用一些pandas groupby函数来协助

考虑这样的事情

# Add a concatenation of the column name and format
df['col_format'] =  df['Column'] + ' ' + df['Format']

# Now create a frame which is the groupby of the DB/Table rows and 
# concatenates the tuples of col_format correctly
y1 = (df.groupby(by=['DB', 'Table'])['col_format']
        .apply(lambda x: '(' + ', '.join(x) + ')'))

# Reset the index to bring the keys/indexes back in as columns
y2 = y1.reset_index()

# Now create a Series of all of the SQL statements
all_outs = 'Create Table ' + y2['DB'] + '.' + y2['Table'] + ' ' + y2['col_format']

# Look at them!
all_outs.values
Out[44]: 
array(['Create Table Dept.Sales (ID INTEGER, Name STRING)',
       'Create Table Retail.Orders (ID INTEGER, Place STRING)'], dtype=object)

希望这有帮助！

Answer 2

这就是我这样做的方式。首先通过df.itertuples创建字典[比df.iterrows更高效]，然后使用str.format无缝地包含值。

使用set确保词典构造的唯一性。

我也转换为生成器，以便您可以根据需要有效地迭代它;总是可以通过list排出发电机，如下所示。

from collections import defaultdict

d = defaultdict(set)
for row in df.itertuples():
    d[(row[1], row[2])].add((row[3], row[4]))

def generate_tables_jp(d):
    for k, v in d.items():
        yield 'CREATE TABLE {0}.{1} ({2})'\
              .format(k[0], k[1], ', '.join([' '.join(i) for i in v]))

list(generate_tables_jp(d))

结果：

['CREATE TABLE Retail.Orders (ID INTEGER, Place STRING)',
 'CREATE TABLE Dept.Sales (ID INTEGER, Name STRING)']

Answer 3

您可以先使用额外列中的每行汇编信息，然后使用groupby.sum

queries = df[KEY_COLUMN] + ' ' + df[KEY_FORMAT] + ', '
queries.index = df.set_index(index_labels).index

DB      Table 
Retail  Orders      ID INTEGER, 
        Orders    Place STRING, 
Dept    Sales       ID INTEGER, 
        Sales      Name STRING, 
dtype: object

queries = queries.groupby(index_labels).sum().str.strip(', ')

DB      Table 
Dept    Sales      ID INTEGER, Name STRING
Retail  Orders    ID INTEGER, Place STRING
dtype: object

def format_queries(queries):
    query_pattern = 'CREATE TABLE %s.%s (%s)'
    for (db, table), text in queries.items():# idx, table, text
        query = query_pattern % (db, table, text)
        yield query
list(format_queries(queries))

['CREATE TABLE Dept.Sales (ID INTEGER, Name STRING)',
 'CREATE TABLE Retail.Orders (ID INTEGER, Place STRING)']

这样您就不需要lambda。我不知道这种方法或itertuples是否会最快

仅在唯一值上循环pandas数据帧

3 个答案: