Question

我有一个数据框，其中包含行中的客户信息和列中的句点（月）。我使用这种格式进行聚类。我想缩放行中的值。我可以使用以下代码执行此操作，但存在一些问题：

代码太复杂了，不应该是一个简单的操作。
“scale”函数在某些情况下会返回“NaN”。
输入明确的客户名称（vars = c（“A”，“B”，...）将无效，因为真实数据有数千名客户。

以下是我的示例数据和代码：

mydata 
  cust P1  P2 P3  P4 P5  P6 P7  P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20
1    A  1 1.0  1 1.0  1 1.0  1 1.0  1 1.0   1 1.0   1 1.0   1 1.0   1 1.0   1 1.0
2    B  5 5.0  5 5.0  5 5.0  5 5.0  5 5.0   5 5.0   5 5.0   5 5.0   5 5.0   5 5.0
3    C  9 9.0  9 9.0  9 9.0  9 9.0  9 9.0   9 9.0   9 9.0   9 9.0   9 9.0   9 9.0
4    D  0 1.0  2 1.0  0 1.0  2 1.0  0 1.0   2 1.0   0 1.0   2 1.0   0 1.0   2 1.0
5    E  4 5.0  6 5.0  4 5.0  6 5.0  4 5.0   6 5.0   4 5.0   6 5.0   4 5.0   6 5.0
6    F  8 9.0 10 9.0  8 9.0 10 9.0  8 9.0  10 9.0   8 9.0  10 9.0   8 9.0  10 9.0
7    G  2 1.5  1 0.5  0 0.5  1 1.5  2 1.5   1 0.5   0 0.5   1 1.5   2 1.5   1 0.5
8    H  6 5.5  5 4.5  4 4.5  5 5.5  6 5.5   5 4.5   4 4.5   5 5.5   6 5.5   5 4.5
9    I 10 9.5  9 8.5  8 8.5  9 9.5 10 9.5   9 8.5   8 8.5   9 9.5  10 9.5   9 8.5

我正在使用的代码：

library(dplyr)
library(tidyr)
# first transpose the data
g_mydata = mydata %>% gather(period,value,-cust)
spr_mydata = g_mydata %>% spread(cust,value)
# then scale the values for each period
sc_mydata = spr_mydata %>% 
      mutate_each_(funs(scale),vars = c("A","B","C","D","E","F","G","H","I") )   
# then transpose again back to original format
g_scdata = sc_mydata %>% gather(cust,value,-period)
scaled_data = g_scdata %>% spread(period,value)

感谢您提供任何帮助或建议。

Answer 1

您可以随时尝试apply()：

sc_mydata = apply(spr_mydata[, -1], 1, scale)

如果NaN混乱，您可以转置spr_mydata并尝试直接运行scale()：

scale(spr_mydata[-1, ])

Answer 2

这是一种dplyr方式。

long_data = 
  mydata %>% 
  gather(period, value,-cust)

to_scale = 
  long_data %>%
  group_by(cust) %>%
  summarize(sd = sd(value)) %>%
  filter(sd != 0) %>%
  select(-sd)

flat = 
  long_data %>%
  anti_join(to_scale) %>%
  mutate(value = 0)

wide_scale = 
  long_data %>%
  right_join(to_scale) %>%
  group_by(cust) %>%
  mutate(value = 
           value %>%
           scale %>%
           signif(7)) %>%
  bind_rows(flat) %>%
  spread(period, value)

type = 
  wide_scale %>%
  select(-cust) %>%
  distinct %>%
  mutate(type_ID = 1:n())

customer__type = 
  type %>%
  left_join(wide_scale) %>%
  select(type_ID, cust)

缩放数据行

2 个答案: