Question

我的问题很简单：我有一个拥有2000万浮点的数组。在该数组中，每个浮动元素都有随机更改的概率 p 。

最简单的方法是遍历数组，先进行（rand（0,1）

但是，即使是并行处理，它的速度也很慢，我在想是否有一种更快的方法可以随机获取一些要修改的索引。

我的第一个想法是选择p * n个随机数，其中n是数组中浮点数的总数，但是，这并不完全代表概率分布，因为在第一种情况下，什么都不能保证仅p * n浮点数将被修改。

想法？

PD：我正在使用python进行实现，可能有人以前曾遇到过此问题，并在库中实现了某些东西，但是我找不到它。

Answer 1

首先，如果p高，即> = 0.5，那么您所做的任何事情都不会节省很多时间，因为您仍然有可能访问大多数元素。但是，如果p较低，则可以从binomial distribution中提取n = 20M，并确定要触摸的元素的概率。

In [23]: np.random.binomial(20*10**6, 0.1)
Out[23]: 1999582

In [24]: np.random.binomial(20*10**6, 0.99999)
Out[24]: 19999801

In [25]: np.random.binomial(20*10**6, 0.5)
Out[25]: 10001202

In [26]: np.random.binomial(20*10**6, 0.0001)
Out[26]: 1986
[...]
In [30]: np.random.binomial(20*10**6, 0.0001)
Out[30]: 1989

In [31]: np.random.binomial(20*10**6, 0.0001)
Out[31]: 1988

这个数字是假设n次试验的成功次数，每个试验都有p次成功的机会，这正是您的情况。

Answer 2

您可以使用

生成一个随机数组，其中[0,1)中的值与数据向量的大小n相同

rnd = np.random.rand(n)

现在，您检查这些随机值在哪些索引处小于p

mask = rnd < p

现在更改掩码包含的所有索引处的数据，例如：

data[mask]=np.random.rand(data[mask].size)

或使用您想要更改数据的任何方法。

Answer 3

您的数组：

array = np.random.random(size=100) # Whatever

一个随机的0/1数组：

p = 0.05 # Could be an array itself
markers = np.random.binomial(1, p, array.shape[0])

要修改的值的索引数组：

locations = np.where(markers)[0]
# Something like array([19, 29, 32, 67, 68, 71])

您可以使用这些索引循环遍历原始数组，或使用array[locations] = ...

一次修改所有值

Answer 4

这在我的机器上每轮运行大约4秒

import random

rand = random.random
p = 0.1
TOTAL_ROUND = 10

x = [rand() for i in xrange(20000000)]

for i in range(TOTAL_ROUND):
    print "round", i
    x = [rand() if val < p else val for val in x]

Answer 5

如果p很小，则可以使用date: "June 24, 2018" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Background A question was posted on [Stackoverflow](https://stackoverflow.com/questions/51013924/calling-functions-in-a-second-file-when-compiling-rmd-files-with-knitr) about how to include functions from one Rmd file while knitting another. If the second file contains R functions to be accessed in the second Rmd file, they're best included as R files rather than Rmd. In this example we'll include three files of functions from the Johns Hopkins University *R Programming* course: `pollutantmean()`, `corr()`, and `complete()`. We'll execute them in a subsequent code block. ```{r sourceCode,echo=TRUE} # use source() function to source the functions we want to execute source("./rprogramming/pollutantmean.r") source("./rprogramming/complete.r") source("./rprogramming/corr.r") ``` ## Executing the sourced files Now that the required R functions have been sourced, we'll execute them. ```{r runCode, echo=TRUE} pollutantmean("specdata","nitrate",70:72) complete("specdata",1:10) corr("specdata",threshold=500) ```提供要更改的元素之间的距离的示例，以节省大量时间。

简单遍历数组ary：

numpy.random.geometric

Numpy分布函数可以生成一个返回值数组，因此，只要p很小，一次生成所有步长值可能甚至更快：

from numpy.random import geometric

index = -1
while True:
  index += geometric(0.01)
  if index >= len(ary):
    break
  ary[ind] = # compute new value

1.1是一个模糊因素，以确保从几何分布中选择足够的样本。对于大型数组，应该没问题，但是不能保证。更好（虽然更复杂）的解决方案是以10000个块的形式生成样本，并继续进行直到您设法到达数组末尾为止。

高效随机抽样

5 个答案: