这是我的问题:我必须生成一些相互关联的合成数据(如7/8列)(使用皮尔森系数)。我可以轻松地做到这一点,但是接下来我必须在每列中插入一定百分比的重复项(是的,皮尔逊系数会更低),每列都不同。 问题是我不想亲自插入重复项,因为就我而言,这很可能会作弊。
有人知道如何生成已经重复的相关数据吗?我已经搜索过,但是通常问题是关于删除或避免重复。
语言:python3 为了生成相关数据,我使用了以下简单代码:Generatin correlated data
答案 0 :(得分:0)
尝试这样的事情:
indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))
for index in indices:
array.append(array[index])
这里,我假设您的数据存储在array
中,它是一个ndarray,其中每一行包含您的7/8列数据。
上面的代码应创建一个随机索引数组,您可以选择其条目(行)并再次附加到该数组中。
答案 1 :(得分:0)
我找到了解决方案。 我发布了代码,可能对某人有帮助。
SELECT a.name, b.userid, b.balance
FROM user AS a
LEFT JOIN
(SELECT userid, SUM(amountin)-SUM(amountout) AS balance
FROM taccbalance
GROUP BY userid
) AS b ON a.userid = b.userid;
最后,这些是我在每一列中的重复百分比:
#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1
#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))
from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)
from scipy.linalg import cholesky
upper_chol = cholesky(a)
# Finally, compute the inner product of upper_chol and rnd
ans = rnd @ upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)
#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4',
'att5', 'att6', 'att7', 'att8'])
#last step is to truncate float values of ans in a variable way, so i got
#duplicates in varying percentage
a = df.values
for i in range(8):
trunc = np.random.randint(5,12)
print(trunc)
a.T[i] = a.T[i].round(decimals=trunc)
#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value