我正在尝试创建一个物流虚拟数据集,用于对数据进行一些分析和可能的预测。
Assumed variables are as follows:
VARIABLES RANGES
awb random number eg:235533
destination_city random cities
product different products
product_category different categories
origin_city random metro cities
logistics_provider_id id's eg:1,20,28,27
dispatch_date datetime between mar01-2015 to mar15-2015
final_delivery_status created,delivered,returned
actual_delivery_date datetime between mar16-2015 to mar30-2015
promised_delivery_date datetime between mar25-2015 to Apr6-2015
因此,从上面的变量假设我想在所提到的范围内创建虚拟数据。如何使用python
创建虚拟数据Expected output:
example_dummy_data:
awb destination_city product product_category
1 104842891 Byatarayanapura Wrangler Denim Jeans Men's Clothing
2 104842938 Bareilly Sky Blue Denim Men's Clothing
3 104842942 Saharanpur puma shoes Men's Footwear
4 104842943 Saharanpur classic puma shoes Men's Foorwear
5 104843066 Mumbai Elegant black belt Fashion Accessories
origin_city log_prov_id dispatch date final_del_status
1 Gurgaon 18 2014-09-02 00:26:11 DEL
2 Bangalore 19 2014-09-01 23:34:30 RTN
3 New Delhi 18 2014-09-01 18:59:41 RTC
4 New Delhi 15 2014-09-02 00:05:33 DEL
5 Hyderabad 16 2014-09-01 22:09:14 UDL
Actual_del_date promised_del_date
1 2014-09-03 00:00:00 2014-09-05 20:00:00
2 2014-09-04 00:00:00 2014-09-06 20:00:00
3 2014-09-04 00:00:00 2014-09-06 20:00:00
4 2014-09-04 00:00:00 2014-09-07 20:00:00
5 2014-09-02 00:00:00 2014-09-06 20:00:00
我想用上面的10000行创建数据,有没有最好的方法在上面提到的范围内创建
Tried:
import random
a = [int(10000*random.random()) for i in xrange(10000)]
找到如何生成随机数但不在我想要的范围和城市中生成。所以请帮助我如何在我提到的范围内创建像我所提到的10000行的虚拟数据。
答案 0 :(得分:4)
faker包是为这种用例构建的。它已经处理了名称,整数和日期,但您可能希望添加自己的产品和产品类别。
import pandas
from faker import Factory
import random
faker = Factory.create()
status = 'created,delivered,returned'.split(',')
def date_between(d1, d2):
f = '%b%d-%Y'
return faker.date_time_between_dates(datetime.strptime(d1, f), datetime.strptime(d2, f))
def fakerecord():
return {'awb': faker.numerify('######'), # random number eg:235533
'destination_city': faker.city(), # random cities
'product': 'random_product', # different products
'product_category': 'random_category', # different categories
'origin_city': faker.city(), # random metro cities
'logistics_provider_id': faker.numerify('##'), # id's eg:1,20,28,27
'dispatch_date': date_between('mar01-2015', 'mar15-2015'), # datetime between mar01-2015 to mar15-2015
'final_delivery_status': random.choice(status), # created,delivered,returned
'actual_delivery_date': date_between('mar16-2015', 'mar30-2015'), # datetime between mar16-2015 to mar30-2015
'promised_delivery_date': date_between('mar25-2015', 'apr06-2015'), # datetime between mar25-2015 to Apr6-2015
}
example_dummy_data = pandas.DataFrame([fakerecord() for _ in range(1000)])
答案 1 :(得分:3)
找到如何生成随机数但不在范围内生成 我想要的城市。那么请帮我如何像我一样创建虚拟数据 在我提到的范围内提到了10000行。
随机范围:
from random import randint
xs = randint(0, 1000) # random int between 0 and 1000
随机选择:
from random import choice
cities = ["Brisbane", "Sydney", "Melbourne"]
random_ciy = choice(cities) # A randomly selected city from cities
随机日期:(感谢ngRepeat )
from random import randrange
from datetime import timedelta
def random_date(start, end):
"""Return a random date between two datetime objects start and end"""
delta = end - start
int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
random_second = randrange(int_delta)
return start + timedelta(seconds=random_second)
输出:
>>> random_date(datetime(2015, 06, 1), datetime(2015, 9, 1))
datetime.datetime(2015, 7, 19, 11, 59, 46)
请参阅:
random.randint()
random.choice()
random.randrange()
datetime.datetime
其余的由您决定如何构建数据集