我使用scipy
的Z分数来规范化我的数据集,如下所示:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore
df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers
print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())
但是,我将某些列更改为NaN
,特别是mta_tax
和trip_type
,如下所示,但在应用Z-score规范化之前它们是数字。这是我的代码中的错误还是Z-score可以生成NaN
?
VendorID RatecodeID PULocationID DOLocationID \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean 1.794324e+00 1.000000e+00 1.106734e+02 1.285285e+02
std 4.041947e-01 4.353414e-04 7.541486e+01 7.729142e+01
min 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
25% 2.000000e+00 1.000000e+00 4.900000e+01 6.100000e+01
50% 2.000000e+00 1.000000e+00 8.200000e+01 1.290000e+02
75% 2.000000e+00 1.000000e+00 1.660000e+02 1.930000e+02
max 2.000000e+00 2.000000e+00 2.650000e+02 2.650000e+02
passenger_count trip_distance fare_amount extra mta_tax \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07 10552857.0
mean 1.140647e+00 2.399851e+00 1.082419e+01 3.607218e-01 0.5
std 4.436568e-01 2.014673e+00 6.464638e+00 3.797668e-01 0.0
min 0.000000e+00 0.000000e+00 0.000000e+00 -6.700000e-01 0.5
25% 1.000000e+00 1.000000e+00 6.000000e+00 0.000000e+00 0.5
50% 1.000000e+00 1.700000e+00 9.000000e+00 5.000000e-01 0.5
75% 1.000000e+00 3.120000e+00 1.350000e+01 5.000000e-01 0.5
max 4.000000e+00 1.117000e+01 4.100000e+01 1.000000e+00 0.5
tip_amount tolls_amount improvement_surcharge total_amount \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean 1.028691e+00 5.512108e-02 3.000000e-01 1.312541e+01
std 1.510206e+00 5.524008e-01 4.110357e-11 7.370554e+00
min 0.000000e+00 0.000000e+00 3.000000e-01 0.000000e+00
25% 0.000000e+00 0.000000e+00 3.000000e-01 7.800000e+00
50% 0.000000e+00 0.000000e+00 3.000000e-01 1.080000e+01
75% 1.860000e+00 0.000000e+00 3.000000e-01 1.630000e+01
max 7.660000e+00 8.000000e+00 3.000000e-01 4.877000e+01
payment_type trip_type
count 1.055286e+07 10552857.0
mean 1.501672e+00 1.0
std 5.061254e-01 0.0
min 1.000000e+00 1.0
25% 1.000000e+00 1.0
50% 1.000000e+00 1.0
75% 2.000000e+00 1.0
max 3.000000e+00 1.0
VendorID RatecodeID PULocationID DOLocationID \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean -1.235870e-12 1.006184e-13 -3.819625e-14 -1.004818e-14
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00
25% 5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01
50% 5.088537e-01 -4.353414e-04 -3.802090e-01 6.100220e-03
75% 5.088537e-01 -4.353414e-04 7.336298e-01 8.341353e-01
max 5.088537e-01 2.297048e+03 2.046369e+00 1.765675e+00
passenger_count trip_distance fare_amount extra mta_tax \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07 0.0
mean -3.942620e-14 -3.206434e-14 -4.744100e-13 -1.042732e-12 NaN
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 NaN
min -2.571013e+00 -1.191187e+00 -1.674370e+00 -2.714092e+00 NaN
25% -3.170185e-01 -6.948283e-01 -7.462435e-01 -9.498508e-01 NaN
50% -3.170185e-01 -3.473773e-01 -2.821804e-01 3.667467e-01 NaN
75% -3.170185e-01 3.574519e-01 4.139144e-01 3.667467e-01 NaN
max 6.444966e+00 4.353138e+00 4.667827e+00 1.683344e+00 NaN
tip_amount tolls_amount improvement_surcharge total_amount \
count 1.055286e+07 1.055286e+07 10552857.0 1.055286e+07
mean 3.152945e-13 -2.877092e-14 -1.0 2.081611e-14
std 1.000000e+00 1.000000e+00 0.0 1.000000e+00
min -6.811593e-01 -9.978459e-02 -1.0 -1.780791e+00
25% -6.811593e-01 -9.978459e-02 -1.0 -7.225258e-01
50% -6.811593e-01 -9.978459e-02 -1.0 -3.155007e-01
75% 5.504607e-01 -9.978459e-02 -1.0 4.307119e-01
max 4.390996e+00 1.438246e+01 -1.0 4.836080e+00
payment_type trip_type
count 1.055286e+07 0.0
mean 1.387184e-12 NaN
std 1.000000e+00 NaN
min -9.912012e-01 NaN
25% -9.912012e-01 NaN
50% -9.912012e-01 NaN
75% 9.845937e-01 NaN
max 2.960389e+00 NaN
谢谢
答案 0 :(得分:5)
要跟进评论,请查看以下内容:
@SpringBootTest(
properties = {
"spring.jpa.properties.hibernate.jdbc.time_zone=UTC",
"logging.level.org.springframework.data.auditing=DEBUG"
})
@RunWith(SpringRunner.class)
@DataJpaTest
@Import(JpaConfig.class)
public class InteractionRepoTest {
@Autowired
InteractionRepo dao;
@Test
public void testInsertSingle() {
ConsumerInteraction i = new ConsumerInteraction();
i.setCustomId(290050);
i = dao.save(i);
dao.flush();
...
请注意&#39; b&#39;并且&#39; d&#39;是常数,这意味着标准偏差为0.应用2018/05/06 00:36:01 DEBUG | main | o.s.d.a.AuditingHandler:161 - Touched ConsumerInteraction{uuid=null, id=290050} - Last modification at java.util.GregorianCalendar[time=1525563361086,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Europe/London",offset=0,dstSavings=3600000,useDaylight=true,transitions=242,lastRule=java.util.SimpleTimeZone[id=Europe/London,offset=0,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=2,startDay=-1,startDayOfWeek=1,startTime=3600000,startTimeMode=2,endMode=2,endMonth=9,endDay=-1,endDayOfWeek=1,endTime=3600000,endTimeMode=2]],firstDayOfWeek=2,minimalDaysInFirstWeek=4,ERA=1,YEAR=2018,MONTH=4,WEEK_OF_YEAR=18,WEEK_OF_MONTH=1,DAY_OF_MONTH=6,DAY_OF_YEAR=126,DAY_OF_WEEK=1,DAY_OF_WEEK_IN_MONTH=1,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=36,SECOND=1,MILLISECOND=86,ZONE_OFFSET=0,DST_OFFSET=3600000] by xxx
2018/05/06 00:36:05 DEBUG | main | o.s.d.a.AuditingHandler:161 - Touched ConsumerInteraction{uuid=4e38645b-aa13-435d-8416-4bb3792a2482, id=290050} - Last modification at java.util.GregorianCalendar[time=1525563365716,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Europe/London",offset=0,dstSavings=3600000,useDaylight=true,transitions=242,lastRule=java.util.SimpleTimeZone[id=Europe/London,offset=0,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=2,startDay=-1,startDayOfWeek=1,startTime=3600000,startTimeMode=2,endMode=2,endMonth=9,endDay=-1,endDayOfWeek=1,endTime=3600000,endTimeMode=2]],firstDayOfWeek=2,minimalDaysInFirstWeek=4,ERA=1,YEAR=2018,MONTH=4,WEEK_OF_YEAR=18,WEEK_OF_MONTH=1,DAY_OF_MONTH=6,DAY_OF_YEAR=126,DAY_OF_WEEK=1,DAY_OF_WEEK_IN_MONTH=1,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=36,SECOND=5,MILLISECOND=716,ZONE_OFFSET=0,DST_OFFSET=3600000] by xxx
Hibernate: insert into tInteraction (audit_user_tx, audit_create_ts, audit_upd_user_tx, audit_update_ts, average_value, company_names, consumer, contract_id, id, description, desk, duration, ext_party_id, include_in_vote, interaction_date, interaction_id, interaction_type, interaction_value, override_value, party_name, payer_allocation_name, regions, sectors, sell_side_contacts, status, updated_by, updated_date, user_comment, user_rating, period_id, region, uuid) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Hibernate: update tInteraction set audit_upd_user_tx=?, audit_update_ts=?, average_value=?, company_names=?, consumer=?, contract_id=?, description=?, desk=?, duration=?, ext_party_id=?, include_in_vote=?, interaction_date=?, interaction_id=?, interaction_type=?, interaction_value=?, override_value=?, party_name=?, payer_allocation_name=?, regions=?, sectors=?, sell_side_contacts=?, status=?, updated_by=?, updated_date=?, user_comment=?, user_rating=? where uuid=?
- 函数意味着减去平均值并除以标准误差。如果将数字除以0,则结果未定义,并且Pandas将显示NaN。
df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })
您需要将zscore
函数应用于所选列,或更改函数以使其省略常量列。要仅将功能应用于所选列,您可以执行以下操作:
df.apply(zscore)
Out[8]:
a b c d
0 -1.224745 NaN -1.224745 NaN
1 0.000000 NaN 0.000000 NaN
2 1.224745 NaN 1.224745 NaN
要让函数检查列是否为常量,让我们使用zscore
函数,如果列的标准偏差为0,则返回列不变,否则使用标准列。
df[['a','c']] = df[['a','c']].apply(zscore)
df
Out[9]:
a b c d
0 -1.224745 2 -1.224745 8
1 0.000000 2 0.000000 8
2 1.224745 2 1.224745 8