Question

我有3个查询，每个查询都提取一个表（参见下面的脚本）。我想将这些表连接到一个新表中，而不必从数据库中的3个原始查询中保存表（仅在内存中）。这可能吗？

我想这样做有两个原因：

我无法CREATE TABLE my_table SELECT ..使用connection.commit()等工作来保存服务器上的表格。
因为这些表格相当大而且我不需要将它们存储在远程数据库中（仅在本地，我使用pickle文件进行每日备份），效率会更高

代码

from mysql.connector import connect as sql_connect
import cPickle as pickle

def extract_values_with_columns(cursor, query, multi=False, verbose=False):
    cursor.execute(query, multi=multi)
    results = list(cursor.fetchall())
    field_names = [i[0] for i in cursor.description]
    if verbose:
        print("Variables: {}".format(field_names), end=" ")
    results.insert(0, field_names)
    return results

def save(dset_name, results):
    with open("{}.pickle".format(dset_name), mode='w') as f:
        f.write(pickle.dumps(results))

if __name__ == '__main__':
    connection = sql_connect(user=SSH_USERNAME, password=DATABASE_PASSWORD,
                                 host='127.0.0.1', port=tunnel.local_bind_port,
                                 database=DATABASE_NAME)      

    print("Connection successful!")
    cursor = connection.cursor()                      # get the cursor
    cursor.execute("USE {}".format(DATABASE_NAME))    # select the database

    # combine ratings and tweet text
    query = "SELECT rt.tweet_id, rt.rating_id, rt.tweet_text, \
             {} \
             FROM contribute_ratedtweet rt \
             INNER JOIN contribute_rating ra ON rt.rating_id=ra.id".format(emotion_factors)
    results = extract_values_with_columns(cursor, query)
    save('agg_tweets_with_ratings', results)

    # combine profiles with demographics and technical data
    # joins should be done on the original variable name, not the renamed one
    demo_vars = "demo.gender, demo.age, demo.ethnicity, demo.education, demo.language, demo.done"
    tech_vars = "tech.entry_point, tech.ip_addr, tech.user_agent, tech.mobile, tech.referrer, tech.time_taken, tech.usage, tech.sharing_consent, tech.time_started"
    query =  "SELECT pro.username, pro.random_seed, \
             demo.id AS demographic_id, {}, \
             tech.id AS technical_data_id, {} \
             FROM contribute_profile pro \
             INNER JOIN contribute_demographic demo ON pro.demographic_id=demo.id \
             INNER JOIN contribute_technicaldata tech ON pro.technical_data_id=tech.id".format(demo_vars, tech_vars)
    results = extract_values_with_columns(cursor, query)
    save('agg_profiles_with_info', results)

    # add userID and tweet ID for convenience to rated tweets
    query = "SELECT pro_rt.profile_id, pro_rt.ratedtweet_id, pro.username, rt.tweet_id \
             FROM contribute_profile_rated_tweets pro_rt \
             INNER JOIN contribute_profile pro ON pro_rt.profile_id=pro.id \
             INNER JOIN contribute_ratedtweet rt ON pro_rt.ratedtweet_id=rt.id"
    results = extract_values_with_columns(cursor, query)
    save('agg_ratings_with_info', results)

Answer 1

由于所有三个查询都与qry2 --> qry3 --> qry1关系相关，因此请考虑使用派生表（FROM或JOIN子句中的嵌套查询）。下面是一个草图，其中每个查询都被视为自己的表结果集。但是，这可能会根据数据的性质返回重复项。因此，在每个子查询或外部查询中进行重复数据删除。

此外，请确保提供唯一的名称，以便不在外部查询选择列中重复别名，重要的是在 t1 ， t2 之间正确使用ON子句， t3 加入。因此请相应地填写...，甚至根据需要使用AS重命名。如果预计结果不完全匹配，请使用LEFT JOIN而不是INNER JOIN。

SELECT t1.*, t2.*, t3.*
FROM
  (SELECT ...
    FROM contribute_profile pro 
    INNER JOIN contribute_demographic demo 
      ON pro.demographic_id=demo.id 
    INNER JOIN contribute_technicaldata tech 
      ON pro.technical_data_id=tech.id) t1

INNER JOIN
   (SELECT ...
    FROM contribute_profile_rated_tweets pro_rt
    INNER JOIN contribute_profile pro 
       ON pro_rt.profile_id=pro.id
    INNER JOIN contribute_ratedtweet rt 
       ON pro_rt.ratedtweet_id=rt.id) t2
ON t1.profile_id = t2.profile_id

INNER JOIN
    (SELECT ...
      FROM contribute_ratedtweet rt 
      INNER JOIN contribute_rating ra 
         ON rt.rating_id=ra.id) t3
ON t2.tweet_rating_id = t3.tweet_rating_id

在内存中加入SQL表

1 个答案: