Question

这是一个非常复杂的问题所以要做好准备！我想在excel中为我的EAV表生成一些测试数据。我的专栏是：

user_id，attribute，value

每个user_id将重复1-4次之间的随机次数，并且对于每个条目，我想从列表中选择一个随机属性，然后是一个可以采用的随机值。最后，我希望每个id条目的属性是唯一的，即我不希望多个条目具有相同的id和属性。以下是我的意思的一个例子：

user_id attribute   value
100001  gender      male
100001  religion    jewish
100001  university  imperial
100002  gender      female
100002  course      physics

可能的值：

attribute   value
gender      male
            female
course      maths
            physics
            chemistry
university  imperial
            cambridge
            oxford
            ucl
religion    jewish
            hindu
            christian
            muslim

很抱歉上面的表搞砸了。在保留结构的同时，我不知道如何粘贴到这里！希望你能看到我在说什么，否则我可以得到截图。

我该怎么做？在过去，我使用随机数生成器和VLOOKUP生成随机数据，但这有点超出了我的联盟。

Answer 1

我的方法是创建一个包含每个ID的所有四个属性的表，然后随机过滤该表以获得每个ID一到四个过滤行。我为每个属性分配了一个随机值。基本设置如下：

randomized eav table with lookup table

左边是随机化的eav表，左边是用于随机值的查找表。这是公式。输入并复制：

A列 - 每四位数建立一个随机数。这决定了必须选择的属性：

=IF(COUNTIF(C$2:C2,C2)=1,RANDBETWEEN(1,4),A1)

B列 - 使用A中的公式确定是否包含行：

=IF(COUNTIF(C$2:C2,C2)=A2,TRUE,RANDBETWEEN(0,1)=1)

C列 - 创建ID，从100,001开始：

=(INT((ROW()-2)/4)+100000)+1

D列 - 重复四个属性：

=CHOOSE(MOD(ROW()-2,4)+1,"gender","course","university","religion")

E列 - 查找查找表中第一个出现的D列属性并选择一个随机偏移的值：

=INDEX($H$2:$H$14,(MATCH(D2,$G$2:$G$14,0))+RANDBETWEEN(0,COUNTIF($G$2:$G$14,D2)-1))

当您对B列中的TRUE进行过滤时，您将获得每个ID一到四个属性的列表。令人沮丧的是，过滤会强制重新计算，因此对于B列中的每个单元格，已过滤的列表将不再显示为TRUE。

如果这是我的话，我会将它自动化一点，或许可以将“魔数”4放在它自己的单元格中（属性的数量）。

Answer 2

有很多方法可以做到这一点。你可以使用perl或python。两者都有用于处理电子表格的模块。在这种情况下，我使用了python和openpyxl模块。

# File:  datagen.py
# Usage: datagen.py <excel (.xlsx) filename to store data>
# Example:  datagen.py myfile.xlsx

import sys
import random
from openpyxl import Workbook
from openpyxl.cell import get_column_letter

# verify that user specified an argument
if len(sys.argv) < 2:
    print "Specify an excel filename to save the data, e.g myfile.xlsx"
    exit(-1)

# get the excel workbook and worksheet objects
wb = Workbook()
ws = wb.get_active_sheet()

# Modify this line to specify the range of user ids
ids = range(100001, 100100)

# data structure for the attributes and values
data = { 'gender':      ['male',    'female'], 
         'course':      ['maths',   'physics',  'chemistry'],
         'university':  ['imperial','cambridge','oxford',   'ucla'],
         'religion':    ['jewish',  'hindu',    'christian','muslim']}

# Write column headers in the spreadsheet          
ws.cell('%s%s'%('A', 1)).value = 'user_id'
ws.cell('%s%s'%('B', 1)).value = 'attribute'
ws.cell('%s%s'%('C', 1)).value = 'value'

row = 1

# Loop through each user id
for user_id in ids:
    # randomly select how many attributes to use
    attr_cnt = random.randint(1,4)
    attributes = data.keys()
    for idx in range(attr_cnt):
        # randomly select attribute
        attr = random.choice(attributes)
        # remove the selected attribute from further selection for this user id
        attributes.remove(attr)
        # randomly select a value for the attribute
        value = random.choice(data[attr])
        row = row + 1
        # write the values for the current row in the spreadsheet
        ws.cell('%s%s'%('A', row)).value = user_id
        ws.cell('%s%s'%('B', row)).value = attr
        ws.cell('%s%s'%('C', row)).value = value

# save the spreadsheet using the filename specified on the cmd line
wb.save(filename = sys.argv[1]) 
print "Done!"

在Excel中为EAV表生成测试数据

2 个答案: