我尝试过滤熊猫中的数据集,以仅获取属于特定时间段列表内的数据。我尝试尝试包含以下数据集进行数据分析:
以下.csv文件中的开始时间和结束时间还作为一列:
我编写了以下代码,但是最后由于列表理解是计算密集型的,所以我遇到了内存错误。有人知道解决我的问题的更好方法吗?
# -*- coding: utf-8 -*-
### Import python modules ###
import pandas as pd
import numpy as np
import os
import xlsxwriter
### Needed Variables ###
timestep = 0.001
### Get current path ###
dirname = os.path.dirname(__file__)
### import the csv data and time sections file ###
df_data = pd.read_csv(r"C:\Users\ricks\OneDrive\Development\Tools\CGDAT\input_data\input_data.csv", header=0, encoding='utf-8')
df_data.columns = df_data.columns.str.title() # Capitalize columns to prohibit key errors
df_data_time = pd.read_csv(r"C:\Users\ricks\OneDrive\Development\Tools\CGDAT\input_data\time_data.csv", header=0, encoding="utf-8", sep=';')
df_data_time.columns = df_data_time.columns.str.title()
### Create extra time column ###
df_data['Time'] = df_data['Timestamp']*timestep
df_data.index = pd.to_datetime(df_data['Time'], unit='s')
### Convert begin and start times to datetime format ###
begin_times = pd.to_datetime(df_data_time['Start Time'], format='%H:%M:%S.%f').dt.time
end_times = pd.to_datetime(df_data_time['End Time'], format='%H:%M:%S.%f').dt.time
### Get data within specific time ranges ###
# Begin time: List containing begin times [00:02:30, 00:07:30, ...]
# End times: List containing end times [00:05:00, 00:10:00, ...]
df_sections = [df_data.between_time(i, j) for i in begin_times for j in end_times]
df_result = pd.concat(df_sections) # Add all the df sections togheter
答案 0 :(得分:1)
我解决了我的问题。 out of memory
错误是由以下行引起的:
df_sections = [df_data.between_time(i, j) for i in begin_times for j in end_times]
问题是该代码可以在begin_times
和end_times
列表的所有可能组合上运行,而我只想执行逐行理解。因此,正确的代码应该正确。
df_sections = [df_data.between_time(i, j) for (i,j) in zip(begin_times, end_times)]
# -*- coding: utf-8 -*-
### Import python modules ###
import pandas as pd
import numpy as np
import os
import xlsxwriter
### Needed Variables ###
timestep = 0.001
### Get current path ###
dirname = os.path.dirname(__file__)
### import the csv data and time sections file ###
df_data = pd.read_csv(r"C:\Users\ricks\OneDrive\Development\Tools\CGDAT\input_data\input_data.csv", header=0, encoding='utf-8')
df_data.columns = df_data.columns.str.title() # Capitalize columns to prohibit key errors
df_data_time = pd.read_csv(r"C:\Users\ricks\OneDrive\Development\Tools\CGDAT\input_data\time_data.csv", header=0, encoding="utf-8", sep=';')
df_data_time.columns = df_data_time.columns.str.title()
### Create extra time column ###
df_data['Time'] = df_data['Timestamp']*timestep
df_data.index = pd.to_datetime(df_data['Time'], unit='s')
### Convert begin and start times to datetime format ###
begin_times = pd.to_datetime(df_data_time['Start Time'], format='%H:%M:%S.%f').dt.time
end_times = pd.to_datetime(df_data_time['End Time'], format='%H:%M:%S.%f').dt.time
### Get data within specific time ranges ###
# Begin time: List containing begin times [00:02:30, 00:07:30, ...]
# End times: List containing end times [00:05:00, 00:10:00, ...]
df_sections = [df_data.between_time(i, j) for (i,j) in zip(begin_times, end_times)]
df_result = pd.concat(df_sections) # Add all the df sections togheter