我是数据分析的新手,正在学习东西。这可能是一个幼稚的问题,但我认为这是一个提出该问题的最佳平台。
为进行分析,我需要为许多列生成伪数据。
Source
Destination
Iti
Iti_S
N_hops
S_Date
F_Date
N_Days
w_Booked
w_left
isBooked
Arrive_by
p_type
p_value
对于每一列,这里是我需要注意的约束:
Source - destination -- Total tuple values 67. Have them in a dataframe
Iti - total of 4-5 for each source-destination(1,2,3,4,5)
Iti_s - selected out of the 5
S_date - 6 months (Jan - June)
F_date - 30 days from booking date(random)
n_hops (each itinerary has hops associated ) for example Iti-1 has 0 hops Iti-2 has 1 hops and so on..
W_booked (each iti has 6000kgs weight booked = random (10-2000) provided that should not exceed 6000 for a particular date)
Arrive_by - 4 days of f_date (randomly assigned)
我早先采用了按行生成随机数据列的方法,但是由于这些列相互链接,因此我想生成s_date中具有上升趋势的100000个条目(在一月份较少,而更多前进)
我尝试了以下方法:
import pandas
from faker import Factory
import random
faker = Factory.create()
def date_between(d1, d2):
f = '%b%d-%Y'
return faker.date_time_between_dates(datetime.strptime(d1, f), datetime.strptime(d2, f))
def fakerecord():
return {'ID': faker.numerify('######'),
'source': random.choice(source), # random( source is arrray of source objects)
'destination': random.choice(destination), # random( destination is arrray of source objects)
'Iti': random.choice(itinerary), # itinerary_list(1-5)
'Iti_S': random.choice(itinerary), # selected itinerary
'S_date': date_between('jan01-2019', 'jun30-2019'), # id's eg:1,20,28,27
'f_date': #needs to be in 30 days from S_date
'num_Days': #f_date-S_date
'w_booked': #random value (2,2000) should not exceed 6000 for a particular date and itinerary
'w_left': #for a particular itinerary and date(6000 - w_booked) I have a hunch that cumsum() can be used for the same.
'arrive_by': # random between 4 days of f_date
'p_type' : # random.choice(p_type)
'p_value': # random value decimal between a range.
}
dummy_data = pandas.DataFrame([fakerecord() for _ in range(100000)])
我的问题在下面:
最终目标是使用以下变量创建一个包含100000行的数据框。
谢谢