Question

我一直在使用“hist”函数将我的数据存储在R中。我现在要做的是有一个hist函数，它不仅需要一个值列表来存储，而是每个值的值和计数。。我在R中写了一个为我做的，但它比内置的hist慢了10-50倍（非常粗略估计）。

有没有办法'本地'做到这一点？

例如，可能是表单的列表（或向量）（1,200）（2,30）（3,50）

第一个值是值，第二个值是该数据的实例数（我可以将数据移动到其他形式，这只是一个例子）

谢谢！

更新：我（基本上）将连续域映射到任意离散域。所以说我有0到10之间的一百个值，我想要输出多少在0和1,1和2等之间的数据。（或者介于0和2,2和4之间或其他什么）。因此，hist函数可以正常工作（我告诉它在哪里划分“桶”）并输出离散化的计数（我可以传入一个不绘制图形的标志）。

但我现在所拥有的不只是一组0到10之间的值，而是一组值，以及有多少个实例。因此，不是将0.1,0.1,0.1,0.1,0.2,0.2,0.5作为7个不同的值，而是以形式（0.1,4），（0.2,2），（0.5,1）形式显示数值和计数。我希望能够在数据上运行'hist'函数（或类似的东西），并获得与“扩展”形式相同的输出。

所以我写了一个函数来做到这一点，但是它比原始的hist运行速度慢了很多。 “展开”数据会使内存太大而无法满足需要。

Answer 1

我不确定您在“分组数据”中的含义，但如果我是对的，您可以通过hist函数获取类别/分组并存储结果。

这可以在不调用graphics的情况下轻松完成，例如：

> table(cut(data, 5))
(-0.000908,0.198]     (0.198,0.397]     (0.397,0.595]     (0.595,0.794] 
               19                20                17                21 
    (0.794,0.993] 
               23

数据是为data <- runif(100)显示目的而制作的。

在上面的命令cut执行主要工作：它将连续变量切换到指定的间隔数（上面：它是5）。我打电话给table来显示频率。

Answer 2

我可能会遗漏一些东西，但我认为这可能有所帮助：

#Generate the data
x <- c(rep(1, 200), rep(2, 30), rep(3, 50))

#Since the midpoints of each bucket will be used and the desired bucket width
#is 1, start the bucket breaks at -0.5
buc <- seq(-0.5, 5, 1)

#Get a histogram using the above bucket breaks
res <- hist(x, breaks=buc)

#Build a data frame with the results
df <- data.frame(mids=res$mids, counts=res$counts)
df

  mids counts
1    0      0
2    1    200
3    2     30
4    3     50
5    4      0

使用names查看hist

中可用的变量

names(res)

[1] "breaks"      "counts"      "intensities" "density"     "mids"        "xname"       "equidist"

Answer 3

你的意思是

barplot(height=c(200,30,50),names.arg=1:3,space=0,ylab="Count")

您也可以通过将数据破解为hist返回的格式并调用graphics:::plot.histogram来实现此目的，即

## must specify counts, mid, breaks, and specify that the bars are equidistant
h <- list(counts=c(200,30,50),mid=1:3,breaks=seq(0.5,3.5,by=1),equidist=TRUE)
graphics:::plot.histogram(h,freq=TRUE)

修改：这取决于您的数据的形式以及您想要的灵活程度关于重新讨论。

粗略的简单版本，如果您想要获取一组现有的中断点，中点和计数，并将每组agg个箱（在您的示例中为agg=2）混为一谈，那么： / p>

mids <- seq(0.1,0.6,by=0.1)
breaks <- seq(0.05,0.65,by=0.1)
counts <- c(3,7,6,9,6,7)

agg <- 2
bnames <- apply(matrix(mids,byrow=TRUE,ncol=agg),1,
                      function(x) paste(head(x,1),tail(x,1),sep="-"))
bmids <- rowMeans(matrix(mids,byrow=TRUE,ncol=agg))
bbreaks <- breaks[seq(1,length(breaks),by=agg)]
bcount <- rowSums(matrix(counts,byrow=TRUE,ncol=agg))

h <- list(counts=bcount,mid=bmids,breaks=bbreaks,equidist=TRUE)
graphics:::plot.histogram(h,freq=TRUE)

Answer 4

与其他响应者一起，我并不完全确定你想要什么，但我猜你想扩展一个更大的向量的表格描述：

unlist( mapply("rep", x=c(1,2,3), times=c(200,30,50) ) )

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [34] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [67] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[100] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[133] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[166] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[199] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3
[232] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[265] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

在R中重新分配数据

4 个答案: