Question

我想创建3个不同的数据集，每个数据集的一列都有日期（dd / mm / yyyy）。这些日期必须在3个月的范围内，例如2019年1月至2019年4月。每个日期的计数必须代表搜索次数。数据集应包含2000个条目，日期也可以是重复的。将创建所有3个数据集，以使一个数据集的数量呈上升趋势，一个数据集的数量呈下降趋势，而一个数据集呈正态分布。

Upward trend with the time, i.e. increasing entries with time ( lower count in beginning and increasing moving forward.)
Declining trend with time i.e. decreasing entries with time (higher count in the beginning and decreasing moving forward)

我能够使用

的datagenerator插件生成正态分布

www.genicata.com

我现在对另外两个用例感兴趣，即上升趋势和下降趋势。谁能告诉我该怎么做。对于随机分发，我还能够使用fakerr库来实现。

from faker import Factory
import random
import numpy as np

faker = Factory.create()

def date_between(d1, d2):
    f = '%b%d-%Y'
    return faker.date_time_between_dates(datetime.strptime(d1, f), datetime.strptime(d2, f))

def fakerecord():
        return {'ID': faker.numerify('######'), 
                'S_date': date_between('jan01-2019', 'apr01-2019')
                }

谁能建议我如何将趋势合并到数据集中。

谢谢

Answer 1

我编辑了第一个答案以使其更清晰。

使用下面的功能，您可以设置在您选择的开始日期和结束日期进行搜索的相对概率。

例如如果starting_prob = 0.1，ending_prob = 1.0，则看到在开始日期进行的搜索是在搜索结果中看到搜索的概率的1/10 结束日期

如果starting_prob = 1.0且Ending_prob = 0.1，则出现结束日期的搜索量是看到搜索结果的概率的1/10 开始日期

import datetime
import numpy  as np


def random_dates(start, end, starting_prob = 0.1, ending_prob = 1.0, num_samples = 2000):
    """
    Generate increasing or decreasing counts of datetimes between `start` and `end`

    Parameters:
    start: string in format'%b%d-%Y' (i.e. 'Sep19-2019')
    end : string in format'%b%d-%Y'. must be after start
    starting_prob: (float) relative probability of seeing a search on the first day
    ending_prob: (float) relative probability of seeing a search on the last day
    num_samples: number of dates in the list
    """
    start_date = datetime.datetime.strptime(start, '%b%d-%Y')
    end_date = datetime.datetime.strptime(end, '%b%d-%Y')

    # Get days between `start` and `end`
    num_days = (end_date - start_date).days

    linear_probabilities = np.linspace(starting_prob, ending_prob, num_days)

    # normalize probabilities so they add up to 1
    linear_probabilities /= np.sum(linear_probabilities)

    rand_days = np.random.choice(num_days, size = num_samples, replace = True,
             p = linear_probabilities)

    rand_date =  [(start_date + datetime.timedelta(int(rand_days[ii]))).strftime('%b%d-%Y') 
                  for ii in range(num_samples)]

    # return list of date strings
    return rand_date

您可以使用该函数生成不同的日期集（每组包含20000个样本）：

rdates_decreasing = random_dates("Jan01-2019", "Apr30-2019",
                      starting_prob = 1.0, ending_prob = 0.1, 
                      num_samples = 20000)

rdates_increasing = random_dates("Jan01-2019", "Apr30-2019",
                      starting_prob = 0.1, ending_prob = 1.0, 
                      num_samples = 20000)

rdates_random = random_dates("Jan01-2019", "Apr30-2019",
                      starting_prob = 1.0, ending_prob = 1.0, 
                      num_samples = 20000)

您可以使用熊猫保存csv文件。每列都会有一个日期列表。

import pandas as pd

pd.DataFrame({'dates_decreasing': rdates_decreasing, 
              'dates_increasing': rdates_increasing, 
              'dates_random': rdates_random, 
             }).to_csv("path\to\datefile.csv", index = False)

您可以将日期转换为如下所示的数据框中的计数：

from collections import Counter
import matplotlib.pyplot as plt

# create dataframe with counts
df1 = pd.DataFrame({"dates_decreasing": list(Counter(rdates_decreasing).keys()), 
                      "counts_decreasing": list(Counter(rdates_decreasing).values()),
                    "dates_increasing": list(Counter(rdates_increasing).keys()), 
                      "counts_increasing": list(Counter(rdates_increasing).values()),
                    "dates_random": list(Counter(rdates_random).keys()), 
                      "counts_random": list(Counter(rdates_random).values()),
                   }) 

# convert to datetime 
df1['dates_decreasing']= pd.to_datetime(df1['dates_decreasing'])
df1['dates_increasing']= pd.to_datetime(df1['dates_increasing'])
df1['dates_random']= pd.to_datetime(df1['dates_random'])


# plot
fig, ax = plt.subplots()
ax.plot(df1.dates_decreasing, df1.counts_decreasing, "o", label = "decreasing")
ax.plot(df1.dates_increasing, df1.counts_increasing, "o", label = "increasing")
ax.plot(df1.dates_random, df1.counts_random, "o", label = "random")
ax.set_ylabel("count")
ax.legend()
fig.autofmt_xdate()
plt.show()

Answer 2

您可以按照以下步骤进行操作。

趋势函数定义您的趋势，如果开始高于结束则为下降趋势，反之亦然。您还可以通过更改开始和结束之间的差异来控制趋势率

import numpy as np
import pandas as pd

dates = pd.date_range("2019-1-1", "2019-4-1", freq="D")

def trend(count, start_weight=1, end_weight=3):
    lin_sp = np.linspace(start_weight, end_weight, count)
    return lin_sp/sum(lin_sp)

date_trends = np.random.choice(dates,size=20000, p=trend(len(dates)))

print("Total dates", len(date_trends))

print("counts of each dates")
print(np.unique(date_trends, return_counts=True)[1])

生成与趋势相关的数据

2 个答案: