我正在努力使用Pandas Data Frames从电子表格转换为Python。
我有一些原始数据:
Date Temperature
12/4/2003 100
12/5/2003 101
12/8/2003 100
12/9/2003 102
12/10/2003 101
12/11/2003 100
12/12/2003 99
12/15/2003 98
12/16/2003 97
12/17/2003 96
12/18/2003 95
12/19/2003 96
12/22/2003 97
12/23/2003 98
12/24/2003 99
12/26/2003 100
12/29/2003 101
在电子表格中,我正在跟踪基于%monitor的趋势。将其视为滚动平均值,但基于%。
电子表格的输出:
date temp monitor trend change_in_trend
12/4/2003 100 97.00 warming false
12/5/2003 101 97.97 warming false
12/8/2003 100 97.97 warming false
12/9/2003 102 98.94 warming false
12/10/2003 101 98.94 warming false
12/11/2003 100 98.94 warming false
12/12/2003 99 98.94 warming false
12/15/2003 98 98.94 cooling true
12/16/2003 97 98.94 cooling false
12/17/2003 96 98.88 cooling false
12/18/2003 95 97.85 cooling false
12/19/2003 96 97.85 cooling false
12/22/2003 97 97.85 cooling false
12/23/2003 98 97.85 warming true
12/24/2003 99 97.85 warming false
12/26/2003 100 97.85 warming false
12/29/2003 101 97.97 warming false
假设:
percent_monitor = .03
warming_factor = 1 - percent_monitor
cooling_factor = 1 + percent_monitor
在电子表格中,我将第一行中的列设置为:
monitor = temp * warming_factor
trending = warming
change_in_trend = false
所有剩余行均基于当前行和上一行的列值得出。
监控列逻辑:
if temp > prev_monitor:
if temp > prev_temp:
if temp * warming_factor > prev_monitor:
monitor = temp*warming_factor
else:
monitor = prev_monitor
else:
monitor = prev_monitor
else:
if temp < prev_monitor:
if temp * cooling_factor < prev_monitor:
monitor = temp * cooling_factor
else:
monitor = prev_monitor
else:
monitor = prev_monitor
趋势列逻辑:
if temp > prev_monitor:
trending = warming
else:
trending = cooling
趋势列逻辑中的更改:
if current_trend - previous_trend:
change_in_trend = false
else:
change in trend = true
我能够遍历数据框并毫无问题地应用逻辑。但是,数千行的性能令人震惊。
我一直在尝试以类似“熊猫”的方式进行此操作,但每次尝试都失败了。
通过粘贴我的代码尝试而不会尴尬,有没有人可以为我提供帮助?
谢谢!
答案 0 :(得分:1)
由于您只是想将其移至Python上,而没有特别设置Pandas,因此我选择了非熊猫方法。我使用了示例行,并在47124
秒内完成了0.182
行。
对于某些用例,Pandas确实非常好且直观,但迭代速度可能非常慢。 This page解释了Pandas的一些较慢的用法,其中之一主要是索引迭代。一个熊猫眼的方法是利用5. Vectorization with NumPy arrays
的优势,但是您的用例似乎足够简单,以至于可能过度使用它,也不值得(假设您的名字是PythonNoob)。>
为了清晰和快速起见,简单使用更基本的python函数可以让您获得所需的速度。
首先,我设置常量
percent_monitor = .03
warming_factor = 1 - percent_monitor
cooling_factor = 1 + percent_monitor
然后(为了易于使用,有更简洁的方法可以做到这一点,但这很清楚),我设置了与列值相对应的列名:
DATE = 0
TEMP = 1
MONITOR = 2
TRENDING = 3
CHANGE_IN_TREND = 4
然后,我以自己的功能提取了您的监视器代码(并稍微清理了if
语句:
def calculate_monitor(prev_monitor, current_temp, prev_temp):
if (current_temp > prev_monitor) and (current_temp > prev_temp) and (current_temp * warming_factor) > prev_monitor:
return current_temp * warming_factor
elif (current_temp < prev_monitor) and ((current_temp * cooling_factor) < prev_monitor):
return current_temp * cooling_factor
else:
return prev_monitor
最后,我读入代码并对其进行处理:
data = [] # I am going to append everything to this
with open('weather_data.csv') as csv_file:
previous_row = None
csv_reader = csv.reader(csv_file, delimiter=' ')
line_count = 0
for row in csv_reader:
cleaned_row = list(filter(None, row))
if line_count == 0:
# first row is column -- I am leaving it blank you can do whatever you want with it
line_count += 1
elif line_count == 1: # this is the first line
previous_row = cleaned_row + [float(cleaned_row[TEMP]) * warming_factor, "warming", False]
data.append(previous_row)
line_count += 1
else:
monitor = calculate_monitor(float(previous_row[MONITOR]), float(cleaned_row[TEMP]), float(previous_row[TEMP]))
current_trend = 'warming' if float(cleaned_row[TEMP]) > float(previous_row[MONITOR]) else 'cooling'
change_in_trend = False if current_trend != previous_row[CHANGE_IN_TREND] else True
previous_row = cleaned_row + [monitor, current_trend, change_in_trend]
data.append(previous_row)
line_count += 1
这将为您提供所需的速度。如果要在最后将其转换为熊猫数据框,则可以执行以下操作:
df = pd.DataFrame(data, columns=['date', 'temp', 'monitor', 'current_trend', 'change_in_trend'])