我的df有不同客户的销售数据,但有一些离群值,我想替换离群值(均值以下2 SD以上)(μ±2σ),并用其每个customer_id均值替换它们。
var app = new Framework7({
root: '#app',
// Create routes for all pages
routes: [
{
path: '/',
url: 'index.html',
},{
// Add your contents page route
path: '/your-page/',
url: 'pages/your-page.html',
},
.....
});
有人可能会帮助我使用dplyr。 注意:所有“ 0”值和销售额(不等于(μ±2σ))都需要替换为与其customer_id相关的平均值
答案 0 :(得分:0)
dplyr的另一种方式:)
不能完全确定是否要基于全局平均值或按客户分组,所以有2个版本。
编辑:要检查<均值-2sd以及!= 0,则必须将ifelse的第一个参数更改为
sales > mean(sales) + 2*sd(sales) | sales < mean(sales) - 2*sd(sales) | sales == 0
代码
# version to check for > global mean + 2 * global sd
# if sales-value > global cutoff sales-value gets replaced by customer mean
test_data2 =
test_data %>% group_by(customer_id) %>%
mutate(sales = ifelse(sales > mean(test_data$sales) + 2*sd(test_data$sales), mean(sales), sales))
# version to check for mean per customer + 2 * sd per customer
# if sales-value > customer cutoff sales-value gets replaced by customer mean
test_data2 =
test_data %>% group_by(customer_id) %>%
mutate(sales = ifelse(sales > mean(sales) + 2*sd(sales), mean(sales), sales))
### check if this is what we want
# calc global mean + global sd + cutoff global
mean(test_data$sales)
sd(test_data$sales)
mean(test_data$sales) + 2*sd(test_data$sales)
# calc mean, sd, cutoff for each customer
test_data %>% group_by(customer_id) %>% summarise(mean = mean(sales), sd = sd(sales), cutoff = mean + 2*sd(sales))
test_data$sales2 = test_data2$sales
test_data %>% filter(customer_id == "80A09")
test_data %>% filter(customer_id == "9000A")
test_data %>% filter(customer_id == "Y90BC")
使用单独的控制代码,不会在两个版本之间进行推断:
df = structure(list(Date = c("6/29/2014", "7/6/2014", "7/13/2014",
"7/20/2014", "7/27/2014", "8/3/2014", "8/10/2014", "8/17/2014",
"8/24/2014", "6/29/2014", "7/6/2014", "7/13/2014", "7/20/2014",
"7/27/2014", "8/3/2014", "8/10/2014", "8/17/2014", "8/24/2014",
"7/6/2014", "7/13/2014", "7/20/2014", "7/27/2014", "8/3/2014",
"8/10/2014", "8/17/2014", "8/24/2014"), customer_id = c("9000A",
"9000A", "9000A", "9000A", "9000A", "9000A", "9000A", "9000A",
"9000A", "80A09", "80A09", "80A09", "80A09", "80A09", "80A09",
"80A09", "80A09", "80A09", "Y90BC", "Y90BC", "Y90BC", "Y90BC",
"Y90BC", "Y90BC", "Y90BC", "Y90BC"), sales = c(20L, 40L, 0L,
42L, 56L, 90L, 500L, 23L, 60L, 200L, 234L, 500L, 450L, 0L, 900L,
459L, 347L, 895L, 380L, 390L, 432L, 320L, 400L, 10L, 0L, 1000L
)), class = "data.frame", row.names = c(NA, -26L))
test_data = df %>% group_by(customer_id) %>% mutate(sales =ifelse( sales > mean(sales) + 2*sd(sales) | sales < mean(sales) - 2*sd(sales) | sales == 0,mean(sales),sales))
test_data$sales_old = df$sales
df %>% group_by(customer_id) %>% summarise(mean = mean(sales), sd = sd(sales), cutoff = mean + 2*sd(sales))
test_data %>% filter(customer_id == "80A09" & sales != sales_old)
test_data %>% filter(customer_id == "9000A" & sales != sales_old)
test_data %>% filter(customer_id == "Y90BC" & sales != sales_old)
输出:
> df %>% group_by(customer_id) %>% summarise(mean = mean(sales), sd = sd(sales), cutoff = mean + 2*sd(sales))
# A tibble: 3 x 4
customer_id mean sd cutoff
<chr> <dbl> <dbl> <dbl>
1 80A09 443. 301. 1045.
2 9000A 92.3 155. 402.
3 Y90BC 366. 310. 986.
> test_data %>% filter(customer_id == "80A09" & sales != sales_old)
# A tibble: 1 x 4
# Groups: customer_id [1]
Date customer_id sales sales_old
<chr> <chr> <dbl> <int>
1 7/27/2014 80A09 443. 0
> test_data %>% filter(customer_id == "9000A" & sales != sales_old)
# A tibble: 2 x 4
# Groups: customer_id [1]
Date customer_id sales sales_old
<chr> <chr> <dbl> <int>
1 7/13/2014 9000A 92.3 0
2 8/10/2014 9000A 92.3 500
> test_data %>% filter(customer_id == "Y90BC" & sales != sales_old)
# A tibble: 2 x 4
# Groups: customer_id [1]
Date customer_id sales sales_old
<chr> <chr> <dbl> <int>
1 8/17/2014 Y90BC 366. 0
2 8/24/2014 Y90BC 366. 1000