我的问题是:我有多个相同的数据库,我想将它们合并为一个。但是我可能有重复的条目作为主键。我想做的是在将重复项放入Mysql之前先对其进行处理。
我的实际代码是:
df = pd.DataFrame()
duplicates = pd.DataFrame()
size=[]
lignes=[]
chunksize = 100000
for db in dbtuple: #For each databases in the tuple given as entry
engine = create_engine(URL(
drivername="mysql",
username="xxx",
password="xxx",
host="localhost",
database=db
))
conn = engine.connect()
#Get the data from the table given as entry
sql = "SELECT * FROM "+ tableName
#Execution of the query above
generator_df = pd.read_sql(sql=sql,con=conn,chunksize= chunksize)
#Init of sizechunk value
sizechunk = 0
#Because the query can be very big number of rows there's a separation
# every 100k rows so that dataframe size <= 100k
for dataframe in generator_df:
df = pd.concat([df,dataframe],ignore_index = True, axis=0,sort=False)
#We add the size of the chunk to know how many rows we have per database
sizechunk+= dataframe.shape[0]
size.append(sizechunk)
if tableName == 'table1':
duplicates = df.duplicated(subset='id')
for i in range(0,len(df)):
if duplicates[i]:
df.id[i] = numligne + '_' + df.id[i]
#same for all tables
但这根本不是pythonic方式,执行时间很长。您对如何改进代码以使其更快有任何建议吗?
这是我的数据库模式,可以更好地理解:
table1 = Table('table1', metadata,
Column('id', VARCHAR(40), primary_key=True,nullable=False),
mysql_engine='InnoDB'
)
table2= Table('table2', metadata,
Column('id', VARCHAR(40), primary_key=True,nullable=False),
Column('id_of', VARCHAR(20),ForeignKey("table1.id"), nullable=False, index= True)
)
table3= Table('table3', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True),
Column('id_produit_enfant', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table4= Table('table4', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table5= Table('table5', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table6= Table('table6', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)```