下面的代码中的变量“ data”包含数百个查询数据库的执行结果。每个执行结果都是一天的数据,其中包含大约7,000行数据(列是时间戳和值)。我每天彼此追加,从而产生数百万行数据(这些数百次追加需要很长时间)。在为一个传感器获取了完整的数据集之后,我将该数据作为一列存储在unitdf DataFrame中,然后为每个传感器重复上述过程并将它们全部合并到unitdf DataFrame中。
我相信追加和合并都是昂贵的操作。我可能找到的唯一可能的解决方案是将每一列拆分为列表,将所有数据添加到列表后,将所有列合并到一个DataFrame中。有什么建议可以加快速度吗?
i = 0
for sensor_id in sensors: #loop through each of the 20 sensors
#prepared statement to query Cassandra
session_data = session.prepare("select timestamp, value from measurements_by_sensor where unit_id = ? and sensor_id = ? and date = ? ORDER BY timestamp ASC")
#Executing prepared statement over a range of dates
data = execute_concurrent(session, ((session_data, (unit_id, sensor_id, date)) for date in dates), concurrency=150, raise_on_first_error=False)
sensordf = pd.DataFrame()
#Loops through the execution results and appends all successful executions that contain data
for (success, result) in data:
if success:
sensordf = sensordf.append(pd.DataFrame(result.current_rows))
sensordf.rename(columns={'value':sensor_id}, inplace=True)
sensordf['timestamp'] = pd.to_datetime(sensordf['timestamp'], format = "%Y-%m-%d %H:%M:%S", errors='coerce')
if i == 0:
i+=1
unitdf = sensordf
else:
unitdf = unitdf.merge(sensordf, how='outer')