Question

我一直在尝试使用我在这里找到的custom function来重新计算聚集到社区的人口普查区域的家庭收入中位数。我的数据看起来像这样

> inc_df[, 1:5]
          San Francisco Bayview Hunters Point Bernal Heights Castro/Upper Market Chinatown
2500-9999             22457                  1057            287                 329      1059
10000-14999           20708                   920            288                 463      1327
1500-19999            12701                   626            145                 148       867
20000-24999           12106                   491            285                 160       689
25000-29999           10129                   554            238                 328       167
30000-34999           10310                   338            257                 179       289
35000-39999            9028                   383            184                 163       326
40000-44999            9532                   472            334                 173       264
45000-49999            8406                   394            345                 241       193
50000-59999           17317                   727            367                 353       251
60000-74999           25947                  1037            674                 794       236
75000-99999           36378                  1185            980                 954       289
100000-124999         33890                   990            640                1208       199
125000-149999         24935                   522            666                 957       234
150000-199999         37190                   814           1310                1535       150
200000-250001         65763                   796           2122                3175       302

功能如下：

GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
  # If "sep" is specified, the function will try to create the 
  #   required "intervals" matrix. "trim" removes any unwanted 
  #   characters before attempting to convert the ranges to numeric.
  if (!is.null(sep)) {
    if (is.null(trim)) pattern <- ""
    else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
    else pattern <- trim
    intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
  }

  Midpoints <- rowMeans(intervals)
  cf <- cumsum(frequencies)
  Midrow <- findInterval(max(cf)/2, cf) + 1
  L <- intervals[1, Midrow]      # lower class boundary of median class
  h <- diff(intervals[, Midrow]) # size of median class
  f <- frequencies[Midrow]       # frequency of median class
  cf2 <- cf[Midrow - 1]          # cumulative frequency class before median class
  n_2 <- max(cf)/2               # total observations divided by 2

  unname(L + (n_2 - cf2)/f * h)
}

应用该函数的代码如下所示：

GroupedMedian(inc_df[, "Bernal Heights"], rownames(inc_df), sep="-", trim="cut")

这一切都运行正常但我无法弄清楚如何将其应用于矩阵的每一列，而不是输入每个列名并反复运行它。我试过这个：

> minc_hood <- data.frame(apply(inc_df, 2, function(x) GroupedMedian(inc_df[, x], 
rownames(inc_df), sep="-", trim="cut")))

但我收到此错误消息

Error in inc_df[, x] : subscript out of bounds

Answer 1

这里有几件事情可以发挥作用：

建议：永远不要将apply与data.frame一起使用（除非您绝对确定不要忘记转换为matrix ^ 1的开销，并且可以接受潜在的数据丢失^ 2）。
即使你要使用apply，你也会做一点＆＃34;关闭＆＃34;：当你说apply(df, 2, func)时，它获取df的第一列并将其作为参数显示，例如
```
apply(mtcars, 2, mean)
```
会拨打电话
```
mean(c(21, 21, 22.8, 21.4, 18.7, ...)) # mpg
mean(c(6, 6, 4, 6, 8, ...))            # cyl
mean(c(160, 160, 108, 258, 360, ...))  # disp
# ... etc
```
在这种情况下，您使用apply(inc_df, 2, function(x) GroupedMedian(inc_df[, x], ...))是错误的，因为x被inc_df的第一列的所有值替换（然后是第二列的所有值，等等））。

由于你的函数看起来像是接受了一个值向量（加上一些其他参数），我建议你尝试类似

的东西

inc_df[] <- lapply(inc_df, GroupedMedian, rownames(inc_df), sep="-", trim="cut")

如果要将此函数应用于这些列的子集，那么这样的方法效果很好：

ind <- c(1,3,7)
inc_df[ind] <- lapply(inc_df[ind], GroupedMedian, rownames(inc_df), sep="-", trim="cut")

使用inc_df[] <- ...（不执行列子集时）可确保我们替换列的值而不会丢失属性为data.frame的属性。它实际上与inc_df <- as.data.frame(...)具有一些其他细微差别相同。

注意：

^ 1：apply将始终将data.frame转换为matrix。这可能没问题，但是如果数据量较大则需要非零时间。它也可能有后果，见下......

^ 2：与matrix不同，data.frame只能有一个类。这意味着所有列都将按logical < integer < numeric < POSIXct < character的顺序上转换为最高常见类型。这意味着如果您拥有所有numeric列和一个character，那么您apply所使用的功能将会看到所有character个数据。这可以通过仅选择具有您期望的类型的列来缓解，可能使用：

isnum <- sapply(inc_df, is.numeric)
inc_df[isnum] <- apply(inc_df[isnum], 2, GroupedMedian, ...)

在这种情况下，您获得的最差转化次数为integer - 至 - numeric，可能是可接受的（且可逆的）转化。

如何在矩阵的每列上应用自定义函数？

1 个答案: