我想创建3个不同的数据集,每个数据集的一列都有日期(dd / mm / yyyy)。这些日期必须在3个月的范围内,例如2019年1月至2019年4月。每个日期的计数必须代表搜索次数。数据集应包含2000个条目,日期也可以是重复的。将创建所有3个数据集,以使一个数据集的数量呈上升趋势,一个数据集的数量呈下降趋势,而一个数据集呈正态分布。
Upward trend with the time, i.e. increasing entries with time ( lower count in beginning and increasing moving forward.)
Declining trend with time i.e. decreasing entries with time (higher count in the beginning and decreasing moving forward)
我能够使用
的datagenerator插件生成正态分布www.genicata.com
我现在对另外两个用例感兴趣,即上升趋势和下降趋势。谁能告诉我该怎么做。对于随机分发,我还能够使用fakerr库来实现。
from faker import Factory
import random
import numpy as np
faker = Factory.create()
def date_between(d1, d2):
f = '%b%d-%Y'
return faker.date_time_between_dates(datetime.strptime(d1, f), datetime.strptime(d2, f))
def fakerecord():
return {'ID': faker.numerify('######'),
'S_date': date_between('jan01-2019', 'apr01-2019')
}
谁能建议我如何将趋势合并到数据集中。
谢谢
答案 0 :(得分:1)
我编辑了第一个答案以使其更清晰。
使用下面的功能,您可以设置在您选择的开始日期和结束日期进行搜索的相对概率。
例如如果starting_prob = 0.1,ending_prob = 1.0,则看到 在开始日期进行的搜索是在搜索结果中看到搜索的概率的1/10 结束日期
如果starting_prob = 1.0且Ending_prob = 0.1,则出现 结束日期的搜索量是看到搜索结果的概率的1/10 开始日期
import datetime
import numpy as np
def random_dates(start, end, starting_prob = 0.1, ending_prob = 1.0, num_samples = 2000):
"""
Generate increasing or decreasing counts of datetimes between `start` and `end`
Parameters:
start: string in format'%b%d-%Y' (i.e. 'Sep19-2019')
end : string in format'%b%d-%Y'. must be after start
starting_prob: (float) relative probability of seeing a search on the first day
ending_prob: (float) relative probability of seeing a search on the last day
num_samples: number of dates in the list
"""
start_date = datetime.datetime.strptime(start, '%b%d-%Y')
end_date = datetime.datetime.strptime(end, '%b%d-%Y')
# Get days between `start` and `end`
num_days = (end_date - start_date).days
linear_probabilities = np.linspace(starting_prob, ending_prob, num_days)
# normalize probabilities so they add up to 1
linear_probabilities /= np.sum(linear_probabilities)
rand_days = np.random.choice(num_days, size = num_samples, replace = True,
p = linear_probabilities)
rand_date = [(start_date + datetime.timedelta(int(rand_days[ii]))).strftime('%b%d-%Y')
for ii in range(num_samples)]
# return list of date strings
return rand_date
您可以使用该函数生成不同的日期集(每组包含20000个样本):
rdates_decreasing = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 1.0, ending_prob = 0.1,
num_samples = 20000)
rdates_increasing = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 0.1, ending_prob = 1.0,
num_samples = 20000)
rdates_random = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 1.0, ending_prob = 1.0,
num_samples = 20000)
您可以使用熊猫保存csv文件。每列都会有一个日期列表。
import pandas as pd
pd.DataFrame({'dates_decreasing': rdates_decreasing,
'dates_increasing': rdates_increasing,
'dates_random': rdates_random,
}).to_csv("path\to\datefile.csv", index = False)
您可以将日期转换为如下所示的数据框中的计数:
from collections import Counter
import matplotlib.pyplot as plt
# create dataframe with counts
df1 = pd.DataFrame({"dates_decreasing": list(Counter(rdates_decreasing).keys()),
"counts_decreasing": list(Counter(rdates_decreasing).values()),
"dates_increasing": list(Counter(rdates_increasing).keys()),
"counts_increasing": list(Counter(rdates_increasing).values()),
"dates_random": list(Counter(rdates_random).keys()),
"counts_random": list(Counter(rdates_random).values()),
})
# convert to datetime
df1['dates_decreasing']= pd.to_datetime(df1['dates_decreasing'])
df1['dates_increasing']= pd.to_datetime(df1['dates_increasing'])
df1['dates_random']= pd.to_datetime(df1['dates_random'])
# plot
fig, ax = plt.subplots()
ax.plot(df1.dates_decreasing, df1.counts_decreasing, "o", label = "decreasing")
ax.plot(df1.dates_increasing, df1.counts_increasing, "o", label = "increasing")
ax.plot(df1.dates_random, df1.counts_random, "o", label = "random")
ax.set_ylabel("count")
ax.legend()
fig.autofmt_xdate()
plt.show()
答案 1 :(得分:0)
您可以按照以下步骤进行操作。
趋势函数定义您的趋势,如果开始高于结束则为下降趋势,反之亦然。您还可以通过更改开始和结束之间的差异来控制趋势率
import numpy as np
import pandas as pd
dates = pd.date_range("2019-1-1", "2019-4-1", freq="D")
def trend(count, start_weight=1, end_weight=3):
lin_sp = np.linspace(start_weight, end_weight, count)
return lin_sp/sum(lin_sp)
date_trends = np.random.choice(dates,size=20000, p=trend(len(dates)))
print("Total dates", len(date_trends))
print("counts of each dates")
print(np.unique(date_trends, return_counts=True)[1])