Question

有没有办法根据数据框内的匹配文本指定DataFrame索引（行）？

我每天都会将位于here的互联网上的文本文件导入到python pandas DataFrame中。我正在解析一些数据并进行计算以给出每天的峰值。我需要收集的特定数据组从标题为“＃34; RTO组合小时结束预测负载MW＆＃34;”开始。

我需要专门使用部分数据来进行我需要的计算，并且我能够手动指定要开始的索引行，但是每天这个数字可能会因文本添加到文件顶部而发生变化作者。

更新时间：05-05-2016 1700预计受限制的运营 05-06-2016的AEP，APS，BC，COMED，DOM和PS区域。约束操作预计在AEP，APS，BC，COMED，DOM和PS区域在05-07-2016。 PS / ConEd 600/400 MW合同将限于 05-05-16，700MW。

有没有办法匹配pandas DataFrame中的文本并指定匹配的索引？目前，我手动指定要使用变量＆＃39; 日＆＃39;开始的索引。在第6行以下。我希望这个变量能够保存包含我想要匹配的文本的数据框的索引（行）。

以下代码有效，但如果行号（索引）发生变化，可能会停止工作：

def forecastload():
    wb = load_workbook(filename = 'pjmactualload.xlsx')
    ws = wb['PJM Load']    
    printRow = 13
    #put this in iteration to pull 2 rows of data at a time (one for each day) for 7 days max
    day = 239
    while day < 251:
        #pulls in first day only
        data = pd.read_csv("http://oasis.pjm.com/doc/projload.txt", skiprows=day, delim_whitespace=True, header=None, nrows=2)

        #sets data at HE 24 = to data that is in HE 13- so I can delete column 0 data to allow checking 'max'
        data.at[1,13]= data.at[1,1]

        #get date for printing it with max load later on
        newDate = str(data.at[0,0])

        #now delete first column to get rid of date data.  date already saved as newDate
        data = data.drop(0,1)
        data = data.drop(1,1)

        #pull out max value of day
        #add index to this for iteration ie dayMax[x] = data.values.max()
        dayMax = data.max().max()
        dayMin = data.min().min()
        #print date and max load for that date
        actualMax = "Forecast Max"
        actualMin = "Forecast Min"
        dayMax = int(dayMax)
        maxResults = [str(newDate),int(dayMax),actualMax,dayMin,actualMin]
        d = 1
        for items in maxResults:
            ws.cell(row=printRow, column=d).value = items
            d += 1        
        printRow += 1        
        #print maxResults
        #l.writerows(maxResults)    
        day = day + 2
    wb.save('pjmactualload.xlsx')

Answer 1

以下是您可以做的事情：

map()

示例代码：

import numpy as np
import pandas a pd
df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
df.loc[df['a'] < 0.5, 'a'] = 1

添加了显示如何访问索引的图片：

You can refer to this documentation

Answer 2

在这种情况下，我建议您使用命令行以获取稍后可以使用pandas阅读的数据集，并执行您想要的任何操作。

要检索数据，您可以使用curl和grep：

$ curl -s http://oasis.pjm.com/doc/projload.txt | grep -A 17 "RTO COMBINED HOUR ENDING INTEGRATED FORECAST" | tail -n +5
 05/06/16 am   68640   66576   65295   65170   66106   70770   77926   83048   84949   85756   86131   86089
          pm   85418   85285   84579   83762   83562   83289   82451   82460   84009   82771   78420   73258
 05/07/16 am   66809   63994   62420   61640   61848   63403   65736   68489   71850   74183   75403   75529
          pm   75186   74613   74072   73950   74386   74978   75135   75585   77414   76451   72529   67957
 05/08/16 am   63583   60903   59317   58492   58421   59378   60780   62971   66289   68997   70436   71212
          pm   71774   71841   71635   71831   72605   73876   74619   75848   78338   77121   72665   67763
 05/09/16 am   63865   61729   60669   60651   62175   66796   74620   79930   81978   83140   84307   84778
          pm   85112   85562   85568   85484   85766   85924   85487   85737   87366   84987   78666   72166
 05/10/16 am   67581   64686   62968   62364   63400   67603   75311   80515   82655   84252   86078   87120
          pm   88021   88990   89311   89477   89752   89860   89256   89327   90469   87730   81220   74449
 05/11/16 am   70367   67044   65125   64265   65054   69060   76424   81785   84646   87097   89541   91276
          pm   92646   93906   94593   94970   95321   95073   93897   93162   93615   90974   84335   77172
 05/12/16 am   71345   67840   65837   64892   65600   69547   76853   82077   84796   87053   89135   90527
          pm   91495   92351   92583   92473   92541   92053   90818   90241   90750   88135   81816   75042

让我们使用上一个输出（在rto.txt文件中），使用awk和sed获取更具可读性的数据：

$ awk '/^ [0-9]/{d=$1;print $0;next}{print d,$0}' rto.txt | sed 's/^ //;s/\s\+/,/g'
05/06/16,am,68640,66576,65295,65170,66106,70770,77926,83048,84949,85756,86131,86089
05/06/16,pm,85418,85285,84579,83762,83562,83289,82451,82460,84009,82771,78420,73258
05/07/16,am,66809,63994,62420,61640,61848,63403,65736,68489,71850,74183,75403,75529
05/07/16,pm,75186,74613,74072,73950,74386,74978,75135,75585,77414,76451,72529,67957
05/08/16,am,63583,60903,59317,58492,58421,59378,60780,62971,66289,68997,70436,71212
05/08/16,pm,71774,71841,71635,71831,72605,73876,74619,75848,78338,77121,72665,67763
05/09/16,am,63865,61729,60669,60651,62175,66796,74620,79930,81978,83140,84307,84778
05/09/16,pm,85112,85562,85568,85484,85766,85924,85487,85737,87366,84987,78666,72166
05/10/16,am,67581,64686,62968,62364,63400,67603,75311,80515,82655,84252,86078,87120
05/10/16,pm,88021,88990,89311,89477,89752,89860,89256,89327,90469,87730,81220,74449
05/11/16,am,70367,67044,65125,64265,65054,69060,76424,81785,84646,87097,89541,91276
05/11/16,pm,92646,93906,94593,94970,95321,95073,93897,93162,93615,90974,84335,77172
05/12/16,am,71345,67840,65837,64892,65600,69547,76853,82077,84796,87053,89135,90527
05/12/16,pm,91495,92351,92583,92473,92541,92053,90818,90241,90750,88135,81816,75042

现在，使用pandas：

阅读并重塑上述结果

df = pd.read_csv("rto2.txt",names=["date","period"]+list(range(1,13)),index_col=[0,1])
df = df.stack().reset_index().rename(columns={"level_2":"hour",0:"value"})
df.index = pd.to_datetime(df.apply(lambda x: "{date} {hour}:00 {period}".format(**x),axis=1))
df.drop(["date", "hour", "period"], axis=1, inplace=True)

此时你有一个美丽的时间序列：）

In [10]: df.head()
Out[10]: 
                     value
2016-05-06 01:00:00  68640
2016-05-06 02:00:00  66576
2016-05-06 03:00:00  65295
2016-05-06 04:00:00  65170
2016-05-06 05:00:00  66106

获取统计数据：

In[11]: df.groupby(df.index.date).agg([min,max])
Out[11]: 
            value       
              min    max
2016-05-06  65170  86131
2016-05-07  61640  77414
2016-05-08  58421  78338
2016-05-09  60651  87366
2016-05-10  62364  90469
2016-05-11  64265  95321
2016-05-12  64892  92583

我希望这可以帮到你。

问候。

Python Pandas：根据DataFrame中的值查找索引

2 个答案: