子集数据帧并应用函数将值转换为因子十分位数范围

时间:2018-05-25 18:16:04

标签: r dataframe lapply quantile r-factor

我无法对列进行分组" short_desc"在我的数据框中并应用创建与该值对应的十分位数范围的代码。十进制范围在列#34; Value.fc"。

中生成

我用来创建十分位数范围的代码是:

df <- df %>%
mutate(Value.fc = cut2(Value, g=10),
     Value.fc = factor(sapply(str_extract_all(Value.fc, "\\d+"),
                              function(x) paste(x, collapse="-"))),
     Value.fc = reorder(Value.fc, Value))

当&#34; short_desc&#34;只有一个级别时,代码可以工作。但是,当我使用group_by()函数应用该代码时,十分位数范围是错误的。 &#34; Value.fc&#34;的样本数据看起来像我使用group_by():

dput(head(df)) structure(list(
   state = c("Iowa", "Iowa", "Illinois"),
   short_desc = c("Corn, grain - yield, measured in bu / acre", "Corn, silage - yield, measured in tons / acre", "Corn, grain - yield, measured in bu / acre"), 
   Value = c(137.8, 13.5, 153.3), 
   FIPS = c("19001", "19001", "17001"), 
   Value.fc = c("135-0-150", "13-0-14-5", "150-4-157"))

&#34; Value.fc&#34;的第一个值应该看起来像&#34; 135-150&#34;不是&#34; 135-0-150&#34;。任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

不是一个答案,但不能在评论中发布更长的代码。适用于data.table

<强> CODE

library(data.table)
library(Hmisc)
df <- data.table(state = state.name[1:50], 
                 value = unname(state.x77[, 1]))
df[, value.fc.inter := cut2(value, g = 10)]
df[, value.fc := unlist(lapply(stringr::str_extract_all(value.fc.inter, "\\d+"), 
                        function(z) paste(z, collapse = "-")))]

我有value.fc.inter作为中间变量来说明这个过程。

<强>输出

> head(df)
        state value value.fc.inter    value.fc
1:    Alabama  3615  [ 2861, 3806)   2861-3806
2:     Alaska   365  [  365,  637)     365-637
3:    Arizona  2212  [ 1544, 2284)   1544-2284
4:   Arkansas  2110  [ 1544, 2284)   1544-2284
5: California 21198  [11197,21198] 11197-21198
6:   Colorado  2541  [ 2284, 2861)   2284-2861

如果您想按value.fc订购,可以使用以下代码,但我认为levels value.fc需要订购才能按预期工作。

> df[order(value.fc, decreasing = F)]
             state value value.fc.inter    value.fc
 1:     California 21198  [11197,21198] 11197-21198
 2:       Illinois 11197  [11197,21198] 11197-21198
 3:       New York 18076  [11197,21198] 11197-21198
 4:   Pennsylvania 11860  [11197,21198] 11197-21198
 5:          Texas 12237  [11197,21198] 11197-21198
 6:        Arizona  2212  [ 1544, 2284)   1544-2284
 7:       Arkansas  2110  [ 1544, 2284)   1544-2284
 8:         Kansas  2280  [ 1544, 2284)   1544-2284
 9:       Nebraska  1544  [ 1544, 2284)   1544-2284
10:  West Virginia  1799  [ 1544, 2284)   1544-2284
11:       Colorado  2541  [ 2284, 2861)   2284-2861
12:    Mississippi  2341  [ 2284, 2861)   2284-2861
13:       Oklahoma  2715  [ 2284, 2861)   2284-2861
14:         Oregon  2284  [ 2284, 2861)   2284-2861
15: South Carolina  2816  [ 2284, 2861)   2284-2861
16:        Alabama  3615  [ 2861, 3806)   2861-3806
17:    Connecticut  3100  [ 2861, 3806)   2861-3806
18:           Iowa  2861  [ 2861, 3806)   2861-3806
19:       Kentucky  3387  [ 2861, 3806)   2861-3806
20:     Washington  3559  [ 2861, 3806)   2861-3806
21:         Alaska   365  [  365,  637)     365-637
22:       Delaware   579  [  365,  637)     365-637
23:         Nevada   590  [  365,  637)     365-637
24:        Vermont   472  [  365,  637)     365-637
25:        Wyoming   376  [  365,  637)     365-637
26:      Louisiana  3806  [ 3806, 4767)   3806-4767
27:       Maryland  4122  [ 3806, 4767)   3806-4767
28:      Minnesota  3921  [ 3806, 4767)   3806-4767
29:      Tennessee  4173  [ 3806, 4767)   3806-4767
30:      Wisconsin  4589  [ 3806, 4767)   3806-4767
31:        Georgia  4931  [ 4767, 5814)   4767-5814
32:        Indiana  5313  [ 4767, 5814)   4767-5814
33:       Missouri  4767  [ 4767, 5814)   4767-5814
34: North Carolina  5441  [ 4767, 5814)   4767-5814
35:       Virginia  4981  [ 4767, 5814)   4767-5814
36:        Florida  8277  [ 5814,11197)  5814-11197
37:  Massachusetts  5814  [ 5814,11197)  5814-11197
38:       Michigan  9111  [ 5814,11197)  5814-11197
39:     New Jersey  7333  [ 5814,11197)  5814-11197
40:           Ohio 10735  [ 5814,11197)  5814-11197
41:          Idaho   813  [  637,  868)     637-868
42:        Montana   746  [  637,  868)     637-868
43:  New Hampshire   812  [  637,  868)     637-868
44:   North Dakota   637  [  637,  868)     637-868
45:   South Dakota   681  [  637,  868)     637-868
46:         Hawaii   868  [  868, 1544)    868-1544
47:          Maine  1058  [  868, 1544)    868-1544
48:     New Mexico  1144  [  868, 1544)    868-1544
49:   Rhode Island   931  [  868, 1544)    868-1544
50:           Utah  1203  [  868, 1544)    868-1544
             state value value.fc.inter    value.fc

如果您只想分组,那么您也可以使用by参数而不是order

> df[, .(state, value) , by = .(value.fc)][1:10]

     value.fc       state value
 1: 2861-3806     Alabama  3615
 2: 2861-3806 Connecticut  3100
 3: 2861-3806        Iowa  2861
 4: 2861-3806    Kentucky  3387
 5: 2861-3806  Washington  3559
 6:   365-637      Alaska   365
 7:   365-637    Delaware   579
 8:   365-637      Nevada   590
 9:   365-637     Vermont   472
10:   365-637     Wyoming   376