创建类似于给定模板的伪造数据

时间:2018-07-31 11:14:20

标签: python pandas faker

我希望制作多个包含以下数据的假excel文件:

DATE       CAR           Cost  Outlet  Code      
2012/01/01 BMW           100   AA      2187 
2012/01/01 Mercedes Benz 200   AA      2187    
2012/01/01 BMW           100   AA      2187 
2012/01/02 Volvo         100   AA      2187  
2012/01/02 BMW           50    AA      2187
2012/01/03 Mercedes Benz 75    AA      2187
...
2012/09/01 BMW           200   AA      2187
2012/09/02 Volvo         100   AA      2187  

这个想法是要能够创建具有与上述模板相似的模板的伪造数据。数据也可以是随机的。

创建用于数据分析的伪列表数据的最佳方法是什么?

3 个答案:

答案 0 :(得分:1)

这里是在工作表中创建随机记录的建议,您可以创建工作簿。

Sub createRandom()

    Dim aCar(3)
    Dim aOutlet(1)
    Dim aCode(1)
    Dim startDate
    Dim i%, sheet%
    Dim sh As Workbook

    aCar(0) = "BMW"
    aCar(1) = "Mercedes Benz"
    aCar(2) = "Volvo"

    aOutlet(0) = "AA"
    aCode(0) = 2187


    startDate = CDate("01/01/2012")

    For sheet = 1 To 5
        Set sh = ActiveWorkbook.Sheets.Add()
        sh.Cells(1, 1) = "Date"
        sh.Cells(1, 2) = "CAR"
        sh.Cells(1, 3) = "Cost"
        sh.Cells(1, 4) = "Outlet"
        sh.Cells(1, 5) = "Code"
        For i = 2 To 100
            sh.Cells(i, 1) = DateAdd("d", Rnd * 28 + 1, startDate) 'Random date
            sh.Cells(i, 2) = aCar(Int(UBound(aCar(2)) * Rnd))
            sh.Cells(i, 3) = Int((100) * Rnd) ' 0-100
            sh.Cells(i, 4) = aOutlet(0)
            sh.Cells(i, 5) = aCode(0)
        Next
    Next
End Sub

答案 1 :(得分:1)

您可以尝试以下方法:

import pandas as pd
import random
from datetime import datetime
from faker import Faker
from faker.providers import BaseProvider

fake = Faker()

# This custom Provider inherits from the BaseProvider
class Provider(BaseProvider):

    # You can change these values as needed.
    start_date = datetime(2012, 1, 1)
    end_date = datetime(2012, 12, 1)    
    cars = ['BMW', 'Mercedes Benz', 'Volvo']    
    cost_start = 50
    cost_end = 200    
    outlets = ['AA', 'BB', 'CC']    
    code_start = 2000
    code_end = 2200


    def date(self):        
        """Return random date between the start and end dates."""        

        self.date = fake.date_between_dates(
            date_start=self.start_date, date_end=self.end_date).strftime('%Y/%m/%d')

        return self.date

    def car(self):
        """Return a random car from cars."""        

        return random.choice(self.cars)

    def cost(self):
        """Return a random cost between the start and end range."""                

        return random.randrange(self.cost_start, self.cost_end)

    def outlet(self):
        """Return a random outlet."""        

        return random.choice(self.outlets)

    def code(self):
        """Return a random code between the start and end range."""        

        return random.randrange(self.code_start, self.code_end)        


# Add the Provider to our faker object
fake.add_provider(Provider)

def create_fake_data(fake, no_of_rows):

    columns = ['date', 'car', 'cost', 'outlet', 'code']
    data = {column: [getattr(fake, column)() for _ in range(no_of_rows)] for column in columns}
    df = pd.DataFrame(data=data)
    df = df[columns]

    return df

print(create_fake_data(fake, 10))

要打印的数据框:

         date            car  cost outlet  code
0  2012/07/01            BMW   173     BB  2059
1  2012/11/14            BMW   120     BB  2026
2  2012/11/23          Volvo    81     AA  2078
3  2012/04/01          Volvo    98     CC  2040
4  2012/01/03          Volvo   171     BB  2173
5  2012/08/29  Mercedes Benz   193     BB  2086
6  2012/08/25          Volvo   156     CC  2018
7  2012/07/13          Volvo    92     CC  2065
8  2012/04/15          Volvo    75     CC  2096
9  2012/07/04            BMW    87     AA  2145

您可以更改存储在类变量中的任何或所有值:

Provider.start_date = datetime(2018, 1, 1)
Provider.end_date = datetime(2018, 9, 1)    
Provider.cars.append('Tesla')
Provider.cost_start = 100
Provider.cost_end = 300    
Provider.outlets.append('DD')
Provider.code_start = 3000
Provider.code_end = 4300

print(create_fake_data(fake, 5))

新输出:

         date            car  cost outlet  code
0  2018/01/29          Volvo   246     DD  3447
1  2018/05/18            BMW   282     AA  3800
2  2018/04/08  Mercedes Benz   175     AA  3547
3  2018/01/07          Tesla   215     CC  3652
4  2018/03/11          Tesla   267     CC  3480

使用每个电子表格中的不同数据写入Excel:

for i in range(5):
    df = create_fake_data(fake, 10) 
    df.to_excel('data_' + str(i) + '.xlsx', index=False) # Stored in your current folder

答案 2 :(得分:1)

您可以使用pydbgen package创建随机数据并作为熊猫数据框返回:

from pydbgen import pydbgen
myDB=pydbgen.pydb()
myDB.gen_dataframe(5,['name','city','phone','date'])

这将输出:

enter image description here