Question

考虑以下格式的数据：

20180101,10
20180102,20
20180103,15
....

第一个是日期，第二个是售出多少产品，而不是将所有这些都插入数据库中，而是使用select max xxxx SQL语句找出一个周期内的最大数量是多少，是否有任何简写形式或有用的库可以达到这个目的？谢谢。

Answer 1

Pandas是您想要的库。

让我给你看一个例子：

import numpy as np
import pandas as pd

# let's build a dummy dataset
index = pd.date_range(start="1/1/2015", end="31/12/2018")
df = pd.DataFrame(np.random.randint(100, size=len(index)),
                  columns=["sales"], index=index)

>>> df.head()
            sales
2015-01-01     32
2015-01-02      0
2015-01-03     12
2015-01-04     77
2015-01-05     86

现在，假设您要每月汇总销售额：

>>> df["sales"].groupby(pd.Grouper(freq="1M")).sum()

2015-01-31    1441
2015-02-28    1164
2015-03-31    1624
2015-04-30    1629
2015-05-31    1427
[...]

或一个学期制

df["sales"].groupby(pd.Grouper(freq="6M", closed="left", label="right")).sum()    
2015-06-30    8921
2015-12-31    9365
2016-06-30    9820
2016-12-31    8881
2017-06-30    8773
2017-12-31    8709
2018-06-30    9481
2018-12-31    9522
2019-06-30      51

由于某种原因Grouper，使用六个月的频率进行装箱在31/12的销售中遇到了一些问题，并将其放入2019年的新装箱中，调查该装箱将使您知道是否有任何发现。 ..或其他人要发表评论，请

或者您想知道哪一个是最好的学期：

>>> df["sales"].groupby(pd.Grouper(freq="6M")).sum().idxmax()              
Timestamp('2016-06-30 00:00:00', freq='6M')

Answer 2

这可能是一个有偏见的答案，但是pandas非常适合处理此类数据。虽然您可以使用元组，列表等完成此类操作。熊猫提供更多功能。例如：

import pandas as pd
data = [[20180101,15], [20180102,10], [20180103,12],[20180104,10]]
df = pd.DataFrame(data=data, columns=['date', 'products'])
# if your data is in csv, excel, database... whatever... you can easily pull
# df = pd.read_csv('name') || pd.read_excel() || pd.read_sql()
df
Out[2]: 
       date  products
0  20180101        15
1  20180102        10
2  20180103        12
3  20180104        10

# It helps to use datetime format to perform operations on the data
# Operations make reference to an "index" in the dataframe
df.index = pd.to_datetime(df['date'], format="%Y%m%d")  #strftime format
df
Out[3]: 
                date  products
date                          
2018-01-01  20180101        15
2018-01-02  20180102        10
2018-01-03  20180103        12
2018-01-04  20180104        10

# Now we can drop that date column...
df.drop(columns='date', inplace=True)
df
Out[4]: 
            products
date                
2018-01-01        15
2018-01-02        10
2018-01-03        12
2018-01-04        10

# Yes, there are ways to do the above in shorthand... lots of info on pandas on SO
# I want you to see the individual steps we are taking to keep simple

# Now is when the fun begins
df.rolling(2).sum()  # prints a rolling 2-day sum
Out[5]: 
            products
date                
2018-01-01       NaN
2018-01-02      25.0
2018-01-03      22.0
2018-01-04      22.0

df.rolling(3).mean()  # prints a rolling 3-day average
Out[6]: 
             products
date                 
2018-01-01        NaN
2018-01-02        NaN
2018-01-03  12.333333
2018-01-04  10.666667

df.resample('W').sum()  # Resamples the data so you can look on a weekly basis
Out[7]: 
            products
date                
2018-01-07        47

df.rolling(2).max() # max number of products over a rolling two-day period
Out[9]: 
            products
date                
2018-01-01       NaN
2018-01-02      15.0
2018-01-03      12.0
2018-01-04      12.0

Answer 3

您应该使用pandas

假设您的日期列称为“日期”，并且它是日期时间dtypes：

import pandas as pd
df = pd.DataFrame(data)
df = df.set_index('date')
df.groupby(pd.Grouper(freq='1M')).max()

最多每月给您。频率可以更改为您喜欢的任何频率。

Answer 4

我尝试了@Patrick Artner的评论：

a = (20180101,10)
b = (20180102,20)
c = (20180103,15)
d = (a,b,c)
maximum = max( d, key = lambda x:x[1])
minimum = min(d, key= lambda x:x[1])
print(minimum)

也许这给了一些启发。

Answer 5

如果这是理想的结果，请

data = [{'date':1, 'products_sold': 2}, {'date':2, 'products_sold': 5},{'date':5, 'products_sold': 2}]
start_date = 1
end_date = 2
max_value_in_period = max(x['products_sold'] for x in data if x['date'] >= start_date and x['date'] <= end_date)
print(max_value_in_period)

有没有简单的方法可以在python中获得峰值和最低值？

5 个答案: