生成虚拟数据以在python中进行分析

时间:2019-09-16 08:33:54

标签: python pandas numpy

我是数据分析的新手,正在学习东西。这可能是一个幼稚的问题,但我认为这是一个提出该问题的最佳平台。

为进行分析,我需要为许多列生成伪数据。

Source 
Destination 
Iti
Iti_S 
N_hops
S_Date
F_Date
N_Days
w_Booked
w_left 
isBooked 
Arrive_by
p_type
p_value 

对于每一列,这里是我需要注意的约束:

Source - destination -- Total tuple values 67. Have them in a dataframe 
Iti - total of 4-5 for each source-destination(1,2,3,4,5)
Iti_s - selected out of the 5
S_date - 6 months (Jan - June)
F_date - 30 days from booking date(random)
n_hops (each itinerary has hops associated ) for example Iti-1 has 0 hops Iti-2 has 1 hops and so on.. 
W_booked (each iti has 6000kgs weight booked = random (10-2000) provided that should not exceed 6000 for a particular date) 
Arrive_by - 4 days of f_date (randomly assigned)

我早先采用了按行生成随机数据列的方法,但是由于这些列相互链接,因此我想生成s_date中具有上升趋势的100000个条目(在一月份较少,而更多前进)

我尝试了以下方法:

import pandas
from faker import Factory
import random

faker = Factory.create()

def date_between(d1, d2):
    f = '%b%d-%Y'
    return faker.date_time_between_dates(datetime.strptime(d1, f), datetime.strptime(d2, f))

    def fakerecord():
        return {'ID': faker.numerify('######'), 
                'source': random.choice(source),  # random( source is arrray of source objects)
                'destination': random.choice(destination),  # random( destination is arrray of source objects)
                'Iti': random.choice(itinerary),  # itinerary_list(1-5)
                'Iti_S': random.choice(itinerary),  # selected itinerary
                'S_date': date_between('jan01-2019', 'jun30-2019'),  # id's eg:1,20,28,27
                'f_date': #needs to be in 30 days from S_date
                'num_Days': #f_date-S_date
                'w_booked': #random value (2,2000) should not exceed 6000 for a particular date and itinerary
                'w_left': #for a particular itinerary and date(6000 - w_booked) I have a hunch that cumsum() can be used for the same. 
                'arrive_by': # random between 4 days of f_date
                'p_type' : # random.choice(p_type)
                    'p_value': # random value decimal between a range.
                    }

dummy_data = pandas.DataFrame([fakerecord() for _ in range(100000)])

我的问题在下面:

  1. 源和目标在单独的数据框中总共有67个元组值。我想使用相同的而不是从数组中选择。我如何在这里实现相同的目标。
  2. 如何计算变量:w_booked,w_left

最终目标是使用以下变量创建一个包含100000行的数据框。

谢谢

0 个答案:

没有答案