为什么在Pandas数据帧中使用Z-score进行标准化会生成NaN列?

时间:2018-05-05 23:10:47

标签: python python-3.x pandas statistics

我使用scipy的Z分数来规范化我的数据集,如下所示:

import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore

df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers

print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())

但是,我将某些列更改为NaN,特别是mta_taxtrip_type,如下所示,但在应用Z-score规范化之前它们是数字。这是我的代码中的错误还是Z-score可以生成NaN

规范化之前:

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean   1.794324e+00  1.000000e+00  1.106734e+02  1.285285e+02   
std    4.041947e-01  4.353414e-04  7.541486e+01  7.729142e+01   
min    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
25%    2.000000e+00  1.000000e+00  4.900000e+01  6.100000e+01   
50%    2.000000e+00  1.000000e+00  8.200000e+01  1.290000e+02   
75%    2.000000e+00  1.000000e+00  1.660000e+02  1.930000e+02   
max    2.000000e+00  2.000000e+00  2.650000e+02  2.650000e+02   

       passenger_count  trip_distance   fare_amount         extra     mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07  10552857.0   
mean      1.140647e+00   2.399851e+00  1.082419e+01  3.607218e-01         0.5   
std       4.436568e-01   2.014673e+00  6.464638e+00  3.797668e-01         0.0   
min       0.000000e+00   0.000000e+00  0.000000e+00 -6.700000e-01         0.5   
25%       1.000000e+00   1.000000e+00  6.000000e+00  0.000000e+00         0.5   
50%       1.000000e+00   1.700000e+00  9.000000e+00  5.000000e-01         0.5   
75%       1.000000e+00   3.120000e+00  1.350000e+01  5.000000e-01         0.5   
max       4.000000e+00   1.117000e+01  4.100000e+01  1.000000e+00         0.5   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07           1.055286e+07  1.055286e+07   
mean   1.028691e+00  5.512108e-02           3.000000e-01  1.312541e+01   
std    1.510206e+00  5.524008e-01           4.110357e-11  7.370554e+00   
min    0.000000e+00  0.000000e+00           3.000000e-01  0.000000e+00   
25%    0.000000e+00  0.000000e+00           3.000000e-01  7.800000e+00   
50%    0.000000e+00  0.000000e+00           3.000000e-01  1.080000e+01   
75%    1.860000e+00  0.000000e+00           3.000000e-01  1.630000e+01   
max    7.660000e+00  8.000000e+00           3.000000e-01  4.877000e+01   

       payment_type   trip_type  
count  1.055286e+07  10552857.0  
mean   1.501672e+00         1.0  
std    5.061254e-01         0.0  
min    1.000000e+00         1.0  
25%    1.000000e+00         1.0  
50%    1.000000e+00         1.0  
75%    2.000000e+00         1.0  
max    3.000000e+00         1.0 

规范化后:

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean  -1.235870e-12  1.006184e-13 -3.819625e-14 -1.004818e-14   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00   
25%    5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01   
50%    5.088537e-01 -4.353414e-04 -3.802090e-01  6.100220e-03   
75%    5.088537e-01 -4.353414e-04  7.336298e-01  8.341353e-01   
max    5.088537e-01  2.297048e+03  2.046369e+00  1.765675e+00   

       passenger_count  trip_distance   fare_amount         extra  mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07      0.0   
mean     -3.942620e-14  -3.206434e-14 -4.744100e-13 -1.042732e-12      NaN   
std       1.000000e+00   1.000000e+00  1.000000e+00  1.000000e+00      NaN   
min      -2.571013e+00  -1.191187e+00 -1.674370e+00 -2.714092e+00      NaN   
25%      -3.170185e-01  -6.948283e-01 -7.462435e-01 -9.498508e-01      NaN   
50%      -3.170185e-01  -3.473773e-01 -2.821804e-01  3.667467e-01      NaN   
75%      -3.170185e-01   3.574519e-01  4.139144e-01  3.667467e-01      NaN   
max       6.444966e+00   4.353138e+00  4.667827e+00  1.683344e+00      NaN   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07             10552857.0  1.055286e+07   
mean   3.152945e-13 -2.877092e-14                   -1.0  2.081611e-14   
std    1.000000e+00  1.000000e+00                    0.0  1.000000e+00   
min   -6.811593e-01 -9.978459e-02                   -1.0 -1.780791e+00   
25%   -6.811593e-01 -9.978459e-02                   -1.0 -7.225258e-01   
50%   -6.811593e-01 -9.978459e-02                   -1.0 -3.155007e-01   
75%    5.504607e-01 -9.978459e-02                   -1.0  4.307119e-01   
max    4.390996e+00  1.438246e+01                   -1.0  4.836080e+00   

       payment_type  trip_type  
count  1.055286e+07        0.0  
mean   1.387184e-12        NaN  
std    1.000000e+00        NaN  
min   -9.912012e-01        NaN  
25%   -9.912012e-01        NaN  
50%   -9.912012e-01        NaN  
75%    9.845937e-01        NaN  
max    2.960389e+00        NaN 

谢谢

1 个答案:

答案 0 :(得分:5)

要跟进评论,请查看以下内容:

@SpringBootTest(
        properties = {
                "spring.jpa.properties.hibernate.jdbc.time_zone=UTC",
                "logging.level.org.springframework.data.auditing=DEBUG"
        })
@RunWith(SpringRunner.class)
@DataJpaTest
@Import(JpaConfig.class)
public class InteractionRepoTest {

    @Autowired
    InteractionRepo dao;

    @Test
    public void testInsertSingle() {
        ConsumerInteraction i = new ConsumerInteraction();
        i.setCustomId(290050);

        i = dao.save(i);
        dao.flush();
    ...

请注意&#39; b&#39;并且&#39; d&#39;是常数,这意味着标准偏差为0.应用2018/05/06 00:36:01 DEBUG | main | o.s.d.a.AuditingHandler:161 - Touched ConsumerInteraction{uuid=null, id=290050} - Last modification at java.util.GregorianCalendar[time=1525563361086,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Europe/London",offset=0,dstSavings=3600000,useDaylight=true,transitions=242,lastRule=java.util.SimpleTimeZone[id=Europe/London,offset=0,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=2,startDay=-1,startDayOfWeek=1,startTime=3600000,startTimeMode=2,endMode=2,endMonth=9,endDay=-1,endDayOfWeek=1,endTime=3600000,endTimeMode=2]],firstDayOfWeek=2,minimalDaysInFirstWeek=4,ERA=1,YEAR=2018,MONTH=4,WEEK_OF_YEAR=18,WEEK_OF_MONTH=1,DAY_OF_MONTH=6,DAY_OF_YEAR=126,DAY_OF_WEEK=1,DAY_OF_WEEK_IN_MONTH=1,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=36,SECOND=1,MILLISECOND=86,ZONE_OFFSET=0,DST_OFFSET=3600000] by xxx 2018/05/06 00:36:05 DEBUG | main | o.s.d.a.AuditingHandler:161 - Touched ConsumerInteraction{uuid=4e38645b-aa13-435d-8416-4bb3792a2482, id=290050} - Last modification at java.util.GregorianCalendar[time=1525563365716,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Europe/London",offset=0,dstSavings=3600000,useDaylight=true,transitions=242,lastRule=java.util.SimpleTimeZone[id=Europe/London,offset=0,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=2,startDay=-1,startDayOfWeek=1,startTime=3600000,startTimeMode=2,endMode=2,endMonth=9,endDay=-1,endDayOfWeek=1,endTime=3600000,endTimeMode=2]],firstDayOfWeek=2,minimalDaysInFirstWeek=4,ERA=1,YEAR=2018,MONTH=4,WEEK_OF_YEAR=18,WEEK_OF_MONTH=1,DAY_OF_MONTH=6,DAY_OF_YEAR=126,DAY_OF_WEEK=1,DAY_OF_WEEK_IN_MONTH=1,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=36,SECOND=5,MILLISECOND=716,ZONE_OFFSET=0,DST_OFFSET=3600000] by xxx Hibernate: insert into tInteraction (audit_user_tx, audit_create_ts, audit_upd_user_tx, audit_update_ts, average_value, company_names, consumer, contract_id, id, description, desk, duration, ext_party_id, include_in_vote, interaction_date, interaction_id, interaction_type, interaction_value, override_value, party_name, payer_allocation_name, regions, sectors, sell_side_contacts, status, updated_by, updated_date, user_comment, user_rating, period_id, region, uuid) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) Hibernate: update tInteraction set audit_upd_user_tx=?, audit_update_ts=?, average_value=?, company_names=?, consumer=?, contract_id=?, description=?, desk=?, duration=?, ext_party_id=?, include_in_vote=?, interaction_date=?, interaction_id=?, interaction_type=?, interaction_value=?, override_value=?, party_name=?, payer_allocation_name=?, regions=?, sectors=?, sell_side_contacts=?, status=?, updated_by=?, updated_date=?, user_comment=?, user_rating=? where uuid=? - 函数意味着减去平均值并除以标准误差。如果将数字除以0,则结果未定义,并且Pandas将显示NaN。

df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })

您需要将zscore函数应用于所选列,或更改函数以使其省略常量列。要仅将功能应用于所选列,您可以执行以下操作:

df.apply(zscore)
Out[8]: 
          a   b         c   d
0 -1.224745 NaN -1.224745 NaN
1  0.000000 NaN  0.000000 NaN
2  1.224745 NaN  1.224745 NaN

要让函数检查列是否为常量,让我们使用zscore函数,如果列的标准偏差为0,则返回列不变,否则使用标准列。

df[['a','c']] = df[['a','c']].apply(zscore)

df
Out[9]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8