熊猫过滤/数据缩减(1)有更好的方法和2)我做错了什么)

时间:2017-09-05 18:23:31

标签: python pandas

我尝试做的是在csv中读取,根据特定列消除重复值,然后将数据减少到不再比15的增量更接近点。

我的代码在文件中读取正常,然后drop_duplicates按需要工作,然后我按该列对其进行排序。为了减少数据,我创建了一个新数据框,其中包含现有数据的第一行,然后我将浏览相关列中的每个值,并将其附加到新数据框中&# 39;比对照值高至少15千克/小时。

我的数据框架没有正确组合,我最后得到的结果数据框如下所示:

Unnamed: 0                                                                     0
TimeStamp (s)                                                              0.002
TC 01 (C)                                                                30.6689
TC 02 (C)                                                                28.6879
TC 03 (C)                                                                27.9779
TC 32 (C)                                                                22.6416
Product Back Pressure (kPa)                                             0.166353
Product Mass Flow (kg/hr)                                                107.427
Semtech Flow (kg/hr)                                                     28.2135
Mass Flow (kg/hr)                                                        28.2135
Voltage (V)                                                              1.63065
Angle (degrees)                                                                0
1                                 Unnamed: 0  TimeStamp (s)  TC 01 (C)  TC 02...
2                                  Unnamed: 0  TimeStamp (s)  TC 01 (C)  TC 0...
3                                 Unnamed: 0  TimeStamp (s)  TC 01 (C)  TC 02...
4                                 Unnamed: 0  TimeStamp (s)  TC 01 (C)  TC 02...

我显然做错了什么,但至少比我试图在"""声明。

 def import_df():
    new_df = pd.read_csv(os.path.join(pathname, f), delimiter = ',')
    new_df = new_df.drop_duplicates(subset = 'Mass Flow (kg/hr)')
    new_df = new_df.sort_values('Mass Flow (kg/hr)')
    reduced_df = new_df.iloc[0]
    current_mass_flow = new_df['Mass Flow (kg/hr)'].iloc[0]
    i = 1
    for value in new_df['Mass Flow (kg/hr)']:
        if value < current_mass_flow + 15:
            reduced_df.loc[i] = new_df.loc[new_df['Mass Flow (kg/hr)'] == value]
            current_mass_flow = value
            i += 1
        else: next

    return reduced_df

我该怎么做才能纠正这个问题?它显然没有按照我期望的方式添加到数据框中。我肯定错过了一些关于如何向这个数据帧添加行的更好的观点。

此外,我无法帮助,但我觉得有一种更简单/更直接的方式来完成我想要做的事情。

Sample Source Data 

    TimeStamp (s)   TC 01 (C)   TC 02 (C)   TC 03 (C)   TC 32 (C)   Product Back Pressure (kPa) Product Mass Flow (kg/hr)   Semtech Flow (kg/hr)    Mass Flow (kg/hr)   Voltage (V) Angle (degrees)
0   0.004   493.2881108 296.1245877 255.8202916 26.3430426  0.297276487 147.4692621 30.21243527 30.21243527 1.634457337 0
1   0.178   493.2881108 296.1245877 255.8202916 26.3430426  0.283227103 147.4692621 30.21243527 30.21243527 1.634457337 0
2   1.178   493.1325481 296.155699  255.8514043 26.3430426  0.283227103 144.5363918 31.06903075 31.06903075 1.634457337 0
3   2.178   493.0703231 296.2490329 255.8825171 26.3430426  0.289335716 141.244467  31.06903075 31.06903075 1.634457337 0
4   3.178   492.4480726 296.373478  255.8825171 26.40525146 0.292389668 141.244467  29.73651711 29.73651711 1.634139868 0
5   4.178   493.2881108 296.373478  255.9136299 26.3430426  0.292389668 146.0926428 30.40291693 30.40291693 1.634457337 0
6   5.178   493.2881108 296.4357006 255.8825171 26.40525146 0.289742626 146.0926428 30.40291693 30.40291693 1.634457337 0
7   6.178   492.8836479 296.4045893 255.9136299 26.40525146 0.281191135 146.0926428 30.78359426 30.78359426 1.634139868 0
8   7.178   493.1325481 296.373478  255.9447427 26.40525146 0.281191135 146.2123624 30.02223961 30.02223961 1.634457337 0
90  959.629 442.3250036 300.5424521 264.6564452 27.77387677 0.593127726 203.9719224 44.39531112 44.39531112 1.635409746 0
91  960.629 442.231666  300.5424521 264.6564452 27.77387677 0.599643603 203.9719224 44.77598845 44.77598845 1.634457337 0
92  961.629 441.3605153 300.3557844 264.6564452 27.77387677 0.58966651  199.4828012 44.77598845 44.77598845 1.634457337 0
93  962.629 441.0493901 300.3557844 264.6253324 27.77387677 0.58966651  199.1237467 43.63367047 43.63367047 1.634774807 0
94  963.629 441.0493901 300.1691166 264.531994  27.77387677 0.58885198  199.1237467 43.63367047 43.63367047 1.635092276 0
95  964.629 441.2360652 300.3868956 264.531994  27.77387677 0.588444716 203.8522028 43.63367047 43.63367047 1.635092276 0
96  965.629 441.4849654 300.3557844 264.4697685 27.77387677 0.588444716 199.1237467 43.63367047 43.63367047 1.634139868 0
97  966.629 441.3916279 300.2935618 264.4697685 27.77387677 0.597403826 199.1237467 44.39531112 44.39531112 1.633823352 0
98  967.629 441.7338656 300.4802295 264.531994  27.77387677 0.592720461 203.8522028 44.39531112 44.39531112 1.634139868 0
99  968.629 441.2982903 300.6046747 264.6253324 27.77387677 0.592720461 203.9719224 43.63367047 43.63367047 1.634139868 0
100 969.629 441.578303  300.6980086 264.687558  27.77387677 0.606769845 203.9719224 45.06142494 45.06142494 1.634139868 0
101 970.629 441.8894282 300.5735634 264.687558  27.77387677 0.594145709 200.3806463 45.06142494 45.06142494 1.635092276 0

期望输出

TimeStamp (s)   TC 01 (C)   TC 02 (C)   TC 03 (C)   TC 32 (C)   Product Back Pressure (kPa) Product Mass Flow (kg/hr)   Semtech Flow (kg/hr)    Mass Flow (kg/hr)   Voltage (V) Angle (degrees)
13  12.178  493.008098  296.2490329 255.8825171 26.3430426  0.31682341  146.0327308 29.26059896 29.26059896 1.634139868 0
77  947.156 443.7872922 301.3202954 264.9986859 27.74277234 0.613081913 199.8419601 44.39531112 44.39531112 1.637947595 0.158889819

3 个答案:

答案 0 :(得分:2)

检查此代码,如果您正在寻找此信息,请与我们联系。 pandas.DataFrame.iterrows用于循环记录并检查质量流量。

  

pandas.DataFrame.iterrows ---一个迭代帧的行的生成器。

import pandas as pd
import os 
new_df = pd.read_csv(os.path.join('C:\Shijo\Python\sample.txt'), delimiter = ',')
new_df = new_df.drop_duplicates(subset = 'Mass Flow (kg/hr)')
new_df = new_df.sort_values('Mass Flow (kg/hr)')
new_df = new_df.reset_index(drop=True)
current_mass_flow = new_df.iloc[0]['Mass Flow (kg/hr)']

indexlst=[1]

for index, row in new_df.iterrows():

    if row['Mass Flow (kg/hr)'] > current_mass_flow + 15:
        print ("Mathcing index : ",index)
        indexlst.append(index)
        current_mass_flow =row['Mass Flow (kg/hr)']


reduced_df= new_df.iloc[indexlst]
print (reduced_df ) 

输出

   TimeStamp (s)   TC 01 (C)   TC 02 (C)   TC 03 (C)  TC 32 (C)  Product Back Pressure (kPa)  Product Mass Flow (kg/hr)    Semtech Flow (kg/hr)  Mass Flow (kg/hr)  Voltage (V)  Angle (degrees)   
1          7.178  493.132548  296.373478  255.944743  26.405251                     0.281191                 146.212362               30.022240          30.022240     1.634457                 0  
8        960.629  442.231666  300.542452  264.656445  27.773877                     0.599644                 203.971922               44.775988          44.775988     1.634457                 0  

答案 1 :(得分:1)

您好像在寻找pandas.cut,您可以在其中指定垃圾箱(例如np.arange)。例如,请参阅this question

编辑: 使用groupbycut

new_df.groupby(pd.cut(new_df['Mass Flow (kg/hr)'],\
                  np.arange(new_df['Mass Flow (kg/hr)'].min(),
                           new_df['Mass Flow (kg/hr)'].max()+1,
                           15)))\
.apply(lambda x: x.loc[x['Mass Flow (kg/hr)'].idxmin()])

答案 2 :(得分:0)

来自重复删除的公寓,将您的数据拆分为“可重现”的垃圾箱会不会很有趣?

df['bins'] = df['Mass Flow (kg/hr)'] // 15                        # Create bins from 0
df.groupby(['bins'], as_index=False).mean().drop('bins', axis=1)  # Get mean values
                                                                  # and remove "bin" column

否则,您可以针对系列的最小值应用与规范化相同的过程。

df['bins'] = (df['Mass Flow (kg/hr)'] - df['Mass Flow (kg/hr)'].min()) // 15  # Normalized bins
df.groupby(['bins'], as_index=False).mean().drop('bins', axis=1)

请注意,此代码已直接在您的数据上进行测试(无法使用pandas.read_clipboard()),但适用于“类似”数据框。