Pandas当列中出现特定类型的值时删除一行

时间:2016-01-24 22:16:49

标签: python pandas

我是这样的DF

>>> [1,2,3,4,4,5].count(4)
2
>>> '1234544'.count('4')
3

当Interval具有此特定格式时,我需要删除整行

         UNIT  EXITSn_hourly           Interval
1867     R081            104  00:00:00-04:00:00
1868     R081              0  04:00:00-04:00:00
1869     R081            129  04:00:00-08:00:00
1870     R081            521  08:00:00-12:00:00
1871     R081           1048  12:00:00-16:00:00
2838     R032             38  00:00:00-04:00:00
2839     R032              0  04:00:00-04:00:00
2840     R032             89  04:00:00-08:00:00
2841     R032            470  08:00:00-12:00:00

我不仅要删除1868 R081 0 04:00:00-04:00:00 ,还要删除像

这样的类似值
04:00:00-04:00:00

实际上这是我原来的df。我创建了一个Interval

01:00:00-01:00:00

我使用此代码创建了间隔

    C/A  UNIT       SCP     DATEn     TIMEn    DESCn  ENTRIESn   EXITSn
0  A002  R051  02-00-00  06-29-13  00:00:00  REGULAR   4174592  1433672
1  A002  R051  02-00-00  06-29-13  04:00:00  REGULAR   4174628  1433675
2  A002  R051  02-00-00  06-29-13  08:00:00  REGULAR   4174641  1433706
3  A002  R051  02-00-00  06-29-13  12:00:00  REGULAR   4174741  1433775
4  A002  R051  02-00-00  06-29-13  16:00:00  REGULAR   4174936  1433826
5  A002  R051  02-00-00  06-29-13  20:00:00  REGULAR   4175270  1433877
6  A002  R051  02-00-00  06-30-13  00:00:00  REGULAR   4175403  1433908
7  A002  R051  02-00-00  06-30-13  04:00:00  REGULAR   4175441  1433914
8  A002  R051  02-00-00  06-30-13  08:00:00  REGULAR   4175457  1433928
9  A002  R051  02-00-00  06-30-13  12:00:00  REGULAR   4175520  1433981

2 个答案:

答案 0 :(得分:0)

可能你想将Interval分成Interval_start和Interval_end并检查它们是否相等:

df['Interval_start'] = df['Interval'].map(lambda s: s.split('-')[0])
df['Interval_end'] = df['Interval'].map(lambda s: s.split('-')[1])
df.query("Interval_start != Interval_end")

      UNIT  EXITSn_hourly           Interval Interval_start Interval_end
1867  R081            104  00:00:00-04:00:00       00:00:00     04:00:00
1869  R081            129  04:00:00-08:00:00       04:00:00     08:00:00
1870  R081            521  08:00:00-12:00:00       08:00:00     12:00:00
1871  R081           1048  12:00:00-16:00:00       12:00:00     16:00:00
2838  R032             38  00:00:00-04:00:00       00:00:00     04:00:00
2840  R032             89  04:00:00-08:00:00       04:00:00     08:00:00
2841  R032            470  08:00:00-12:00:00       08:00:00     12:00:00

答案 1 :(得分:0)

您可以比较字符串的各个部分,然后按子集删除它们:

print df.Interval.str[0:2]
1867    00
1868    04
1869    04
1870    08
1871    12
2838    00
2839    04
2840    04
2841    08
Name: Interval, dtype: object

print df.Interval.str[0:2] != df.Interval.str[9:11]
1867     True
1868    False
1869     True
1870     True
1871     True
2838     True
2839    False
2840     True
2841     True
Name: Interval, dtype: bool

print df[df.Interval.str[0:2] != df.Interval.str[9:11]]
      UNIT  EXITSn_hourly           Interval
1867  R081            104  00:00:00-04:00:00
1869  R081            129  04:00:00-08:00:00
1870  R081            521  08:00:00-12:00:00
1871  R081           1048  12:00:00-16:00:00
2838  R032             38  00:00:00-04:00:00
2840  R032             89  04:00:00-08:00:00
2841  R032            470  08:00:00-12:00:00

编辑:

我检查了您的代码,也许您可​​以省略copy.deepcopy并使用copy

df = turnstile_data.copy(deep=True)

df['ENTRIESn_hourly'] = (df['ENTRIESn'] - df['ENTRIESn'].shift(periods=1)).fillna(0)
df['EXITSn_hourly'] = (df['EXITSn'] - df['EXITSn'].shift(periods=1)).fillna(0)
df['Interval'] = (df['TIMEn'].shift(periods=1)+'-'+ df['TIMEn']).fillna(0)

df.loc[(df['ENTRIESn'] == 0), 'ENTRIESn_hourly'] = 0
df.loc[(df['EXITSn'] == 0), 'EXITSn_hourly'] = 0
df.loc[(df['C/A'] != df['C/A'].shift(periods=1)) | 
       (df['UNIT'] != df['UNIT'].shift(periods=1)) | 
       (df['SCP'] != df['SCP'].shift(periods=1)), 
['ENTRIESn_hourly', 'EXITSn_hourly','Interval']] = 0

print df.head(5)
   ENTRIESn_hourly  EXITSn_hourly           Interval  
0                0              0                  0  
1               36              3  00:00:00-04:00:00  
2               13             31  04:00:00-08:00:00  
3              100             69  08:00:00-12:00:00  
4              195             51  12:00:00-16:00:00  

required_df=df[['UNIT','EXITSn_hourly','Interval']].groupby(df.UNIT)

print required_df.head(5)
   UNIT  EXITSn_hourly           Interval
0  R051              0                  0
1  R051              3  00:00:00-04:00:00
2  R051             31  04:00:00-08:00:00
3  R051             69  08:00:00-12:00:00
4  R051             51  12:00:00-16:00:00