如何在python中创建具有不同数据类型的虚拟数据?

时间:2015-05-26 09:12:14

标签: python

我正在尝试创建一个物流虚拟数据集,用于对数据进行一些分析和可能的预测。

Assumed variables are as follows:

 VARIABLES                     RANGES
 awb                           random number eg:235533
 destination_city              random cities
 product                       different products
 product_category              different categories
 origin_city                   random metro cities
 logistics_provider_id         id's eg:1,20,28,27
 dispatch_date                 datetime between mar01-2015 to mar15-2015
 final_delivery_status         created,delivered,returned
 actual_delivery_date          datetime between mar16-2015 to mar30-2015
 promised_delivery_date        datetime between mar25-2015 to Apr6-2015

因此,从上面的变量假设我想在所提到的范围内创建虚拟数据。如何使用python

创建虚拟数据
Expected output:

example_dummy_data:

  awb        destination_city   product               product_category
1 104842891  Byatarayanapura    Wrangler Denim Jeans  Men's Clothing
2 104842938  Bareilly           Sky Blue Denim        Men's Clothing
3 104842942  Saharanpur         puma shoes            Men's Footwear
4 104842943  Saharanpur         classic puma shoes    Men's Foorwear
5 104843066  Mumbai             Elegant black belt    Fashion Accessories

  origin_city  log_prov_id   dispatch date          final_del_status
1 Gurgaon      18            2014-09-02 00:26:11       DEL
2 Bangalore    19            2014-09-01 23:34:30       RTN
3 New Delhi    18            2014-09-01 18:59:41       RTC
4 New Delhi    15            2014-09-02 00:05:33       DEL
5 Hyderabad    16            2014-09-01 22:09:14       UDL

  Actual_del_date        promised_del_date
1 2014-09-03 00:00:00   2014-09-05 20:00:00
2 2014-09-04 00:00:00   2014-09-06 20:00:00
3 2014-09-04 00:00:00   2014-09-06 20:00:00
4 2014-09-04 00:00:00   2014-09-07 20:00:00
5 2014-09-02 00:00:00   2014-09-06 20:00:00

我想用上面的10000行创建数据,有没有最好的方法在上面提到的范围内创建

Tried:

import random
a = [int(10000*random.random()) for i in xrange(10000)]

找到如何生成随机数但不在我想要的范围和城市中生成。所以请帮助我如何在我提到的范围内创建像我所提到的10000行的虚拟数据。

2 个答案:

答案 0 :(得分:4)

faker包是为这种用例构建的。它已经处理了名称,整数和日期,但您可能希望添加自己的产品和产品类别。

import pandas
from faker import Factory
import random

faker = Factory.create()
status = 'created,delivered,returned'.split(',')

def date_between(d1, d2):
    f = '%b%d-%Y'
    return faker.date_time_between_dates(datetime.strptime(d1, f), datetime.strptime(d2, f))

def fakerecord():
    return {'awb': faker.numerify('######'),  # random number eg:235533
            'destination_city': faker.city(),  # random cities
            'product': 'random_product',  # different products
            'product_category': 'random_category',  # different categories
            'origin_city': faker.city(),  # random metro cities
            'logistics_provider_id': faker.numerify('##'),  # id's eg:1,20,28,27
            'dispatch_date': date_between('mar01-2015', 'mar15-2015'),  # datetime between mar01-2015 to mar15-2015
            'final_delivery_status': random.choice(status),  # created,delivered,returned
            'actual_delivery_date': date_between('mar16-2015', 'mar30-2015'),  # datetime between mar16-2015 to mar30-2015
            'promised_delivery_date': date_between('mar25-2015', 'apr06-2015'),  # datetime between mar25-2015 to Apr6-2015
            }

example_dummy_data = pandas.DataFrame([fakerecord() for _ in range(1000)])

答案 1 :(得分:3)

  

找到如何生成随机数但不在范围内生成   我想要的城市。那么请帮我如何像我一样创建虚拟数据   在我提到的范围内提到了10000行。

随机范围:

from random import randint

xs = randint(0, 1000)  # random int between 0 and 1000

随机选择:

from random import choice

cities = ["Brisbane", "Sydney", "Melbourne"]

random_ciy = choice(cities)  # A randomly selected city from cities

随机日期:感谢ngRepeat

from random import randrange
from datetime import timedelta

def random_date(start, end):
    """Return a random date between two datetime objects start and end"""

    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
    random_second = randrange(int_delta)

    return start + timedelta(seconds=random_second)

输出:

>>> random_date(datetime(2015, 06, 1), datetime(2015, 9, 1))
datetime.datetime(2015, 7, 19, 11, 59, 46)

请参阅:

其余的由您决定如何构建数据集