根据列扩展数据集

时间:2019-10-23 05:29:48

标签: python r pandas numpy

我有一个数据框:

I_Code  Date_1  Date_2
2   14/09/2019  16/08/2019
2   14/09/2019  17/08/2019
2   14/09/2019  19/08/2019
2   14/09/2019  20/08/2019
2   14/09/2019  21/08/2019
2   14/09/2019  21/08/2019
2   14/09/2019  21/08/2019
2   14/09/2019  22/08/2019
2   14/09/2019  23/08/2019
2   14/09/2019  23/08/2019
2   14/09/2019  24/08/2019
2   14/09/2019  27/08/2019
2   14/09/2019  28/08/2019
2   14/09/2019  28/08/2019
2   14/09/2019  29/08/2019
2   14/09/2019  04/09/2019
2   14/09/2019  04/09/2019
2   14/09/2019  04/09/2019
2   14/09/2019  05/09/2019
2   14/09/2019  08/09/2019
2   14/09/2019  10/09/2019
2   14/09/2019  10/09/2019
2   14/09/2019  12/09/2019
2   03/09/2019  04/08/2019
2   03/09/2019  05/08/2019
2   03/09/2019  06/08/2019
2   03/09/2019  07/08/2019
2   03/09/2019  07/08/2019
2   03/09/2019  08/08/2019
2   03/09/2019  08/08/2019
2   03/09/2019  09/08/2019
2   03/09/2019  13/08/2019
2   03/09/2019  13/08/2019

我目前在数据框中有800个条目。我想将此数据集扩展为Date_2上带有约束的20k条目,这样Date_2上的条目数(按月总计计)应遵循对数增长趋势,即先升高后停滞。 (附图片)Comparator

请注意,该图仅是示例。

之前,我可以使用以下功能来获得图形:

    def random_dates(start, end, starting_prob = 0.1, ending_prob = 1.0, date_format = '%d-%m-%y', num_samples = 20000):
        start_date = datetime.datetime.strptime(start, date_format)
        end_date = datetime.datetime.strptime(end, date_format)

        # Get days between `start` and `end`
        num_days = (end_date - start_date).days

        linear_probabilities = expon.cdf(np.linspace(starting_prob, ending_prob, num_days), scale = 0.3)

        # normalize probabilities so they add up to 1
        linear_probabilities /= np.sum(linear_probabilities)

        rand_days = np.random.choice(num_days, size = num_samples, replace = True,
                 p = linear_probabilities)

        rand_date =  [(start_date + datetime.timedelta(int(rand_days[ii]))).strftime(date_format) 
                      for ii in range(num_samples)]

        # return list of date strings
        return rand_date

start_date = '02-08-19'
end_date = '29-09-19'
date_format = '%d-%m-%y'
sample_count = 20000

date_2 = random_dates(start_date, end_date, starting_prob = 0.1, ending_prob = 1.0, date_format=date_format, num_samples=sample_count)

但是现在其他变量(即date_1和I_Code)也已连接到date_2。他们没有这样的约束。

任何人都可以提供帮助。

谢谢

0 个答案:

没有答案