我有一个看起来如下的数据框
<?php.validate.executablePath
if(empty($_POST['name']) ||
empty($_POST['email']) ||
empty($_POST['phone']) ||
empty($_POST['ankunft']) ||
empty($_POST['abreise']) ||
empty($_POST['message']) ||
!filter_var($_POST['email'],FILTER_VALIDATE_EMAIL))
{
echo "No arguments Provided!";
return false;
}
$name = strip_tags(htmlspecialchars($_POST['name']));
$email_address = strip_tags(htmlspecialchars($_POST['email']));
$phone = strip_tags(htmlspecialchars($_POST['phone']));
$ankunft = strip_tags(htmlspecialchars($_POST['ankunft']));
$abreise = strip_tags(htmlspecialchars($_POST['abreise']));
$message = strip_tags(htmlspecialchars($_POST['message']));
// Create the email and send the message
$to = 'myemail@gmail.com';
$email_subject = "Website Contact Form: $name";
$email_body = "You have received a new message from your website.\n\n"."Here are the details:\n\nName: $name\n\nEmail: $email_address\n\nPhone: $phone\n\nAnkunft: $ankunft\n\nAbreise: $abreise\n\nMessage:\n$message";
$headers = "From: myemail@gmail.com\n"; // add noreply Email
$headers .= "Reply-To: $email_address";
mail($to,$email_subject,$email_body,$headers);
return true;
?>
观察每个变量的箱形图后,我发现它们中有离群值。
因此,在+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
| | Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
| 0 | 6 | 148.0 | 72.0 | 35.0 | 125.0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85.0 | 66.0 | 29.0 | 125.0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183.0 | 64.0 | 29.0 | 125.0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 |
+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
之外的每一列中,我都希望替换该特定列的Outcome
和greater than 95 percentile with value at 75 percentile
的值
例如,在列less than 5 percentile with 25 percentile
中高于95%的值我想用Glucose
列中75%的值替换它们
我该如何使用熊猫过滤器和百分位数功能
对此将提供任何帮助
答案 0 :(得分:3)
您可以在apply
以外的所有列上使用outcome
,并使用np.clip
和np.percentile
函数:
import numpy as np
percentile_df = df.set_index('Outcome').apply(lambda x: np.clip(x, *np.percentile(x, [25,75]))).reset_index()
>>> percentile_df
Outcome Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 1 6.0 148.0 66.0 35.0 125.0 33.6
1 0 1.0 89.0 66.0 29.0 125.0 26.6
2 1 6.0 148.0 64.0 29.0 125.0 26.6
3 0 1.0 89.0 66.0 29.0 125.0 28.1
4 1 1.0 137.0 64.0 35.0 125.0 33.6
DiabetesPedigreeFunction Age
0 0.627 33.0
1 0.351 31.0
2 0.672 32.0
3 0.351 31.0
4 0.672 33.0
[EDIT] 我一开始误解了问题,这是一种使用np.select
将第5个百分位数和第95个百分位数分别更改为第25个和第75个百分位数的方法:
def cut(column):
conds = [column > np.percentile(column, 95),
column < np.percentile(column, 5)]
choices = [np.percentile(column, 75),
np.percentile(column, 25)]
return np.select(conds,choices,column)
df.set_index('Outcome',inplace=True)
df = df.apply(lambda x: cut(x)).reset_index()
>>> df
Outcome Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 1 6.0 148.0 66.0 35.0 125.0 33.6
1 0 1.0 89.0 66.0 29.0 125.0 26.6
2 1 6.0 148.0 64.0 29.0 125.0 26.6
3 0 1.0 89.0 66.0 29.0 125.0 28.1
4 1 1.0 137.0 64.0 35.0 125.0 33.6
DiabetesPedigreeFunction Age
0 0.627 33.0
1 0.351 31.0
2 0.672 32.0
3 0.351 31.0
4 0.672 33.0