生成由N = 100个二维样本组成的数据集

时间:2013-02-17 10:45:49

标签: python matrix dataset covariance gaussian

如何生成由N = 100二维样本x = (x1,x2)T ∈ R2组成的数据集,该样本来自二维高斯分布,均值为

µ = (1,1)T

和协方差矩阵

Σ = (0.3 0.2 
     0.2 0.2)

我被告知您可以使用 Matlab 函数randn,但不知道如何在Python中实现它?

3 个答案:

答案 0 :(得分:8)

仅详细说明@ EamonNerbonne的答案:以下使用协方差矩阵的Cholesky decomposition来生成来自不相关的正态分布随机变量的相关变量。

import numpy as np
import matplotlib.pyplot as plt
linalg = np.linalg

N = 1000
mean = [1,1]
cov = [[0.3, 0.2],[0.2, 0.2]]
data = np.random.multivariate_normal(mean, cov, N)
L = linalg.cholesky(cov)
# print(L.shape)
# (2, 2)
uncorrelated = np.random.standard_normal((2,N))
data2 = np.dot(L,uncorrelated) + np.array(mean).reshape(2,1)
# print(data2.shape)
# (2, 1000)
plt.scatter(data2[0,:], data2[1,:], c='green')    
plt.scatter(data[:,0], data[:,1], c='yellow')
plt.show()

enter image description here

黄色点由np.random.multivariate_normal生成。通过将正态分布点乘以Cholesky分解矩阵L生成绿点。

答案 1 :(得分:5)

您正在寻找numpy.random.multivariate_normal

<强>代码

>>> import numpy
>>> print numpy.random.multivariate_normal([1,1], [[0.3, 0.2],[0.2, 0.2]], 100)
[[ 0.02999043  0.09590078]
 [ 1.35743021  1.08199363]
 [ 1.15721179  0.87750625]
 [ 0.96879114  0.94503228]
 [ 1.23989167  1.13473083]
 [ 1.55917608  0.81530847]
 [ 0.89985651  0.7071519 ]
 [ 0.37494324  0.739433  ]
 [ 1.45121732  1.17168444]
 [ 0.69680785  1.2727178 ]
 [ 0.35600769  0.46569276]
 [ 2.14187488  1.8758589 ]
 [ 1.59276393  1.54971412]
 [ 1.71227009  1.63429704]
 [ 1.05013136  1.1669758 ]
 [ 1.34344004  1.37369725]
 [ 1.82975724  1.49866636]
 [ 0.80553877  1.26753018]
 [ 1.74331784  1.27211784]
 [ 1.23044292  1.18110192]
 [ 1.07675493  1.05940509]
 [ 0.15495771  0.64536509]
 [ 0.77409745  1.0174171 ]
 [ 1.20062726  1.3870498 ]
 [ 0.39619719  0.77919884]
 [ 0.87209168  1.00248145]
 [ 1.32273339  1.54428262]
 [ 2.11848535  1.44338789]
 [ 1.45226461  1.42061198]
 [ 0.33775737  0.24968543]
 [ 1.06982557  0.64674411]
 [ 0.92113229  1.0583153 ]
 [ 0.54987592  0.73198037]
 [ 1.06559727  0.77891362]
 [ 0.84371805  0.72957046]
 [ 1.83614557  1.40582746]
 [ 0.53146009  0.72294094]
 [ 0.98927818  0.73732053]
 [ 1.03984002  0.89426628]
 [ 0.38142362  0.32471126]
 [ 1.44464929  1.15407227]
 [-0.22601279  0.21045592]
 [-0.01995875  0.45051782]
 [ 0.58779449  0.44486237]
 [ 1.31335981  0.92875936]
 [ 0.42200098  0.6942829 ]
 [ 0.10714426  0.11083002]
 [ 1.44997839  1.19052704]
 [ 0.78630506  0.45877582]
 [ 1.63432202  1.95066539]
 [ 0.56680926  0.92203111]
 [ 0.08841491  0.62890576]
 [ 1.4703602   1.4924649 ]
 [ 1.01118864  1.44749407]
 [ 1.19936276  1.02534702]
 [ 0.67893239  0.8482461 ]
 [ 0.71537211  0.53279103]
 [ 1.08031573  1.00779064]
 [ 0.66412568  0.57121041]
 [ 0.96098528  0.72318386]
 [ 0.7690299   0.76058713]
 [ 0.77466896  0.77559282]
 [ 0.47906664  0.58602633]
 [ 0.52481326  0.78486453]
 [-0.40240438  0.17374116]
 [ 0.75730444  0.22365892]
 [ 0.67811008  1.17730408]
 [ 1.62245699  1.71775386]
 [ 1.12317847  1.04252136]
 [-0.06461117  0.23557416]
 [ 0.46299482  0.51585414]
 [ 0.88125676  1.23284201]
 [ 0.57920534  0.63765861]
 [ 0.88239858  1.32092112]
 [ 0.63500551  0.94788141]
 [ 1.76588148  1.63856465]
 [ 0.65026599  0.6899672 ]
 [ 0.06854287  0.29712499]
 [ 0.61575737  0.87526625]
 [ 0.30057552  0.54475194]
 [ 0.66578769  0.21034844]
 [ 0.94670438  0.7699764 ]
 [ 0.39870371  0.91681577]
 [ 1.37531351  1.62337899]
 [ 1.92350877  1.34382017]
 [ 0.56631877  0.77456137]
 [ 1.18702642  0.63700271]
 [ 0.74002244  1.04535471]
 [ 0.3272063   0.75097037]
 [ 1.57583435  1.55809705]
 [ 0.44325124  0.39620769]
 [ 0.59762516  0.58304621]
 [ 0.72253698  0.68302097]
 [ 0.93459597  1.01101948]
 [ 0.50139577  0.52500942]
 [ 0.84696441  0.68679341]
 [ 0.63483432  0.22205385]
 [ 1.43642478  1.34724612]
 [ 1.58663111  1.49941374]
 [ 0.73832806  0.95690866]]
>>> 

答案 2 :(得分:2)

尽管numpy具有方便的实用程序功能,但您始终可以“重新缩放”多个独立的正态分布变量以匹配给定的协方差矩阵。因此,如果您可以生成列向量x(或在矩阵中分组的许多向量),其中每个元素都是正态分布的,并且您按矩阵M进行缩放,则结果将具有协方差{{1 }}。相反,如果将协方差M M^T分解为C形式,那么即使没有numpy提供的实用函数,生成这样的分布也非常简单(只需将正常分布的向量乘以{{1} })。

这可能不是你直接寻找的答案,但记住这一点很有用:

  • 如果您发现自己缩放随机生成的结果,则可以将缩放与初始协方差相结合
  • 如果您需要将代码移植到不直接支持这种实用方法的库,那么您自己就很容易实现。