Concat数据库并更新重复的行

时间:2019-05-21 12:05:33

标签: python mysql pandas dataframe

我的问题是:我有多个相同的数据库,我想将它们合并为一个。但是我可能有重复的条目作为主键。我想做的是在将重复项放入Mysql之前先对其进行处理。

我的实际代码是:

df = pd.DataFrame() 
duplicates = pd.DataFrame()
size=[]
lignes=[]
chunksize = 100000


for db in dbtuple: #For each databases in the tuple given as entry
    engine = create_engine(URL(
            drivername="mysql",
            username="xxx",
            password="xxx",
            host="localhost",
            database=db
        ))
    conn = engine.connect()
    #Get the data from the table given as entry
    sql = "SELECT * FROM "+ tableName

    #Execution of the query above
    generator_df = pd.read_sql(sql=sql,con=conn,chunksize= chunksize)

    #Init of sizechunk value
    sizechunk = 0

    #Because the query can be very big number of rows there's a separation
    # every 100k rows so that dataframe size <= 100k 
    for dataframe in generator_df:
        df = pd.concat([df,dataframe],ignore_index = True, axis=0,sort=False)

        #We add the size of the chunk to know how many rows we have per database
        sizechunk+= dataframe.shape[0]

        size.append(sizechunk)


    if tableName == 'table1':
        duplicates = df.duplicated(subset='id')

        for i in range(0,len(df)):
            if duplicates[i]:

                df.id[i] = numligne + '_' + df.id[i]
    #same for all tables

但这根本不是pythonic方式,执行时间很长。您对如何改进代码以使其更快有任何建议吗?

这是我的数据库模式,可以更好地理解:

table1 = Table('table1', metadata,
    Column('id', VARCHAR(40), primary_key=True,nullable=False),
    mysql_engine='InnoDB'
    )

table2= Table('table2', metadata,
    Column('id', VARCHAR(40), primary_key=True,nullable=False),
    Column('id_of', VARCHAR(20),ForeignKey("table1.id"), nullable=False, index= True)
    )

table3= Table('table3', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True),
    Column('id_produit_enfant', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table4= Table('table4', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table5= Table('table5', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table6= Table('table6', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )```

0 个答案:

没有答案