有没有办法根据数据框内的匹配文本指定DataFrame索引(行)?
我每天都会将位于here的互联网上的文本文件导入到python pandas DataFrame中。我正在解析一些数据并进行计算以给出每天的峰值。我需要收集的特定数据组从标题为“#34; RTO组合小时结束预测负载MW"”开始。
我需要专门使用部分数据来进行我需要的计算,并且我能够手动指定要开始的索引行,但是每天这个数字可能会因文本添加到文件顶部而发生变化作者。
更新时间:05-05-2016 1700预计受限制的运营 05-06-2016的AEP,APS,BC,COMED,DOM和PS区域。约束 操作预计在AEP,APS,BC,COMED,DOM和PS区域 在05-07-2016。 PS / ConEd 600/400 MW合同将限于 05-05-16,700MW。
有没有办法匹配pandas DataFrame中的文本并指定匹配的索引?目前,我手动指定要使用变量' 日'开始的索引。在第6行以下。我希望这个变量能够保存包含我想要匹配的文本的数据框的索引(行)。
以下代码有效,但如果行号(索引)发生变化,可能会停止工作:
def forecastload():
wb = load_workbook(filename = 'pjmactualload.xlsx')
ws = wb['PJM Load']
printRow = 13
#put this in iteration to pull 2 rows of data at a time (one for each day) for 7 days max
day = 239
while day < 251:
#pulls in first day only
data = pd.read_csv("http://oasis.pjm.com/doc/projload.txt", skiprows=day, delim_whitespace=True, header=None, nrows=2)
#sets data at HE 24 = to data that is in HE 13- so I can delete column 0 data to allow checking 'max'
data.at[1,13]= data.at[1,1]
#get date for printing it with max load later on
newDate = str(data.at[0,0])
#now delete first column to get rid of date data. date already saved as newDate
data = data.drop(0,1)
data = data.drop(1,1)
#pull out max value of day
#add index to this for iteration ie dayMax[x] = data.values.max()
dayMax = data.max().max()
dayMin = data.min().min()
#print date and max load for that date
actualMax = "Forecast Max"
actualMin = "Forecast Min"
dayMax = int(dayMax)
maxResults = [str(newDate),int(dayMax),actualMax,dayMin,actualMin]
d = 1
for items in maxResults:
ws.cell(row=printRow, column=d).value = items
d += 1
printRow += 1
#print maxResults
#l.writerows(maxResults)
day = day + 2
wb.save('pjmactualload.xlsx')
答案 0 :(得分:0)
以下是您可以做的事情:
示例代码:
import numpy as np
import pandas a pd
df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
df.loc[df['a'] < 0.5, 'a'] = 1
添加了显示如何访问索引的图片:
答案 1 :(得分:0)
在这种情况下,我建议您使用命令行以获取稍后可以使用pandas
阅读的数据集,并执行您想要的任何操作。
要检索数据,您可以使用curl
和grep
:
$ curl -s http://oasis.pjm.com/doc/projload.txt | grep -A 17 "RTO COMBINED HOUR ENDING INTEGRATED FORECAST" | tail -n +5
05/06/16 am 68640 66576 65295 65170 66106 70770 77926 83048 84949 85756 86131 86089
pm 85418 85285 84579 83762 83562 83289 82451 82460 84009 82771 78420 73258
05/07/16 am 66809 63994 62420 61640 61848 63403 65736 68489 71850 74183 75403 75529
pm 75186 74613 74072 73950 74386 74978 75135 75585 77414 76451 72529 67957
05/08/16 am 63583 60903 59317 58492 58421 59378 60780 62971 66289 68997 70436 71212
pm 71774 71841 71635 71831 72605 73876 74619 75848 78338 77121 72665 67763
05/09/16 am 63865 61729 60669 60651 62175 66796 74620 79930 81978 83140 84307 84778
pm 85112 85562 85568 85484 85766 85924 85487 85737 87366 84987 78666 72166
05/10/16 am 67581 64686 62968 62364 63400 67603 75311 80515 82655 84252 86078 87120
pm 88021 88990 89311 89477 89752 89860 89256 89327 90469 87730 81220 74449
05/11/16 am 70367 67044 65125 64265 65054 69060 76424 81785 84646 87097 89541 91276
pm 92646 93906 94593 94970 95321 95073 93897 93162 93615 90974 84335 77172
05/12/16 am 71345 67840 65837 64892 65600 69547 76853 82077 84796 87053 89135 90527
pm 91495 92351 92583 92473 92541 92053 90818 90241 90750 88135 81816 75042
让我们使用上一个输出(在rto.txt
文件中),使用awk
和sed
获取更具可读性的数据:
$ awk '/^ [0-9]/{d=$1;print $0;next}{print d,$0}' rto.txt | sed 's/^ //;s/\s\+/,/g'
05/06/16,am,68640,66576,65295,65170,66106,70770,77926,83048,84949,85756,86131,86089
05/06/16,pm,85418,85285,84579,83762,83562,83289,82451,82460,84009,82771,78420,73258
05/07/16,am,66809,63994,62420,61640,61848,63403,65736,68489,71850,74183,75403,75529
05/07/16,pm,75186,74613,74072,73950,74386,74978,75135,75585,77414,76451,72529,67957
05/08/16,am,63583,60903,59317,58492,58421,59378,60780,62971,66289,68997,70436,71212
05/08/16,pm,71774,71841,71635,71831,72605,73876,74619,75848,78338,77121,72665,67763
05/09/16,am,63865,61729,60669,60651,62175,66796,74620,79930,81978,83140,84307,84778
05/09/16,pm,85112,85562,85568,85484,85766,85924,85487,85737,87366,84987,78666,72166
05/10/16,am,67581,64686,62968,62364,63400,67603,75311,80515,82655,84252,86078,87120
05/10/16,pm,88021,88990,89311,89477,89752,89860,89256,89327,90469,87730,81220,74449
05/11/16,am,70367,67044,65125,64265,65054,69060,76424,81785,84646,87097,89541,91276
05/11/16,pm,92646,93906,94593,94970,95321,95073,93897,93162,93615,90974,84335,77172
05/12/16,am,71345,67840,65837,64892,65600,69547,76853,82077,84796,87053,89135,90527
05/12/16,pm,91495,92351,92583,92473,92541,92053,90818,90241,90750,88135,81816,75042
现在,使用pandas
:
df = pd.read_csv("rto2.txt",names=["date","period"]+list(range(1,13)),index_col=[0,1])
df = df.stack().reset_index().rename(columns={"level_2":"hour",0:"value"})
df.index = pd.to_datetime(df.apply(lambda x: "{date} {hour}:00 {period}".format(**x),axis=1))
df.drop(["date", "hour", "period"], axis=1, inplace=True)
此时你有一个美丽的时间序列:)
In [10]: df.head()
Out[10]:
value
2016-05-06 01:00:00 68640
2016-05-06 02:00:00 66576
2016-05-06 03:00:00 65295
2016-05-06 04:00:00 65170
2016-05-06 05:00:00 66106
获取统计数据:
In[11]: df.groupby(df.index.date).agg([min,max])
Out[11]:
value
min max
2016-05-06 65170 86131
2016-05-07 61640 77414
2016-05-08 58421 78338
2016-05-09 60651 87366
2016-05-10 62364 90469
2016-05-11 64265 95321
2016-05-12 64892 92583
我希望这可以帮到你。
问候。