我在python中有一个DataFrame,它有一个保存2个日期的列。我想在现有列上创建一个新的/覆盖,可以根据以下规则将数字转换为分类变量:
difference 0 days Level 0
difference 2 days Level 1
difference 2-6 days Level 2
difference 6-15 days Level 3
difference 15-69 days Level 4
difference NAT Level 5
如何实现这一目标。
答案 0 :(得分:1)
说列名称是'差异'
您可以定义类似
的方法projection = glm::perspective(glm::radians(95.0f), (float)SCR_WIDTH / (float)SCR_HEIGHT, 0.1f, 100.0f);
参考链接: https://github.com/vi3k6i5/pandas_basics/blob/master/2_b_apply_a_function_row_wise.ipynb
https://github.com/vi3k6i5/pandas_basics/blob/master/2_c_apply_a_function_to_a_column.ipynb
答案 1 :(得分:1)
pd.cut
:data['Severity'] = pd.cut((data['End'] - data['Start']).dt.days,[-pd.np.inf,-1,0,2,6,15,69],labels=['Level 5', 'Level 0','Level 1','Level 2','Level 3','Level 4']).fillna('Level 5')
示例:
df.head(10)
Start End
0 2017-01-01 2017-01-11
1 2017-01-02 2017-03-09
2 2017-01-03 2017-03-16
3 2017-01-04 2017-01-10
4 2017-01-05 2017-01-25
5 2017-01-06 2017-01-25
6 2017-01-07 2017-01-26
7 2017-01-08 2017-01-15
8 2017-01-09 2017-02-16
9 2017-01-10 2017-01-24
df['Severity'] = pd.cut((df['End'] - df['Start']).dt.days,[-np.inf,0,2,6,15,69,np.inf],labels=['Level 0','Level 1','Level 2','Level 3','Level 4','Level 5'])
输出:
End Start Severity
0 2017-01-11 2017-01-01 Level 3
1 2017-03-09 2017-01-02 Level 4
2 2017-03-16 2017-01-03 Level 5
3 2017-01-10 2017-01-04 Level 2
4 2017-01-25 2017-01-05 Level 4
5 2017-01-25 2017-01-06 Level 4
6 2017-01-26 2017-01-07 Level 4
7 2017-01-15 2017-01-08 Level 3
8 2017-02-16 2017-01-09 Level 4
9 2017-01-24 2017-01-10 Level 3
答案 2 :(得分:1)
您可以使用np.searchsorted
查找每个时间点落入断点数组的位置。我用NaT
Level 6
个差异
td = pd.to_timedelta(['0 days', '2 days', '6 days', '15 days', '69 days'])
difs = df.End.values - df.Start.values
vals = np.searchsorted(td.values, difs)
vals[pd.isnull(difs)] = 6
df = df.assign(
Level=np.core.defchararray.add(
'Level ', vals.astype(str)
)
)
df
Start End Level
0 2017-01-01 2017-01-11 Level 3
1 2017-01-02 2017-03-09 Level 4
2 2017-01-03 2017-03-16 Level 5
3 2017-01-04 2017-01-10 Level 2
4 2017-01-05 2017-01-05 Level 0
5 2017-01-06 2017-01-08 Level 1
6 2017-01-07 2017-01-26 Level 4
7 2017-01-08 2017-01-15 Level 3
8 2017-01-09 2017-02-16 Level 4
9 2017-01-10 2017-01-24 Level 3
设置
import pandas as pd
from io import StringIO
txt = """ Start End
0 2017-01-01 2017-01-11
1 2017-01-02 2017-03-09
2 2017-01-03 2017-03-16
3 2017-01-04 2017-01-10
4 2017-01-05 2017-01-05
5 2017-01-06 2017-01-08
6 2017-01-07 2017-01-26
7 2017-01-08 2017-01-15
8 2017-01-09 2017-02-16
9 2017-01-10 2017-01-24"""
df = pd.read_csv(StringIO(txt), delim_whitespace=True).apply(pd.to_datetime)
答案 3 :(得分:0)
我添加了条形图来分析分布。我还使用了数据框和lambda函数来获取日差值。视觉效果可帮助您理解数据。直方图可以让您深入了解分类分布,而对图则可以显示日间隔的分布方式。
column1=['2017-01-01','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-09','2017-01-10']
column2=['2017-01-01','2017-01-11','2017-03-09','2017-03-16','2017-01-10','2017-01-25','2017-01-25','2017-01-26','2017-01-15','2017-02-16','2017-01-24' ]
index=range(0,len(column1))
data={'column1':column1,'column2':column2}
df=pd.DataFrame(data, columns=['column1','column2'],index=index)
print(df.head())
differences=df.apply(lambda x: datetime.strptime(x['column2'],'%Y-%m-%d')- datetime.strptime(x['column1'],'%Y-%m-%d'),axis=1)
differences=differences.dt.days.astype('int')
years_bins=[-1,0,2,6,15,69,np.inf]
output_labels=['level 0','level 1','level 2','level 3','level 4','level 5']
out=pd.cut(differences,bins=years_bins,labels=output_labels)
df['differences']=differences
df['classification']=out
print(df.head())
fig, ax = plt.subplots()
ax = out.value_counts(sort=False).plot.barh(rot=0, color="b", figsize=(6,4))
ax.set_yticklabels(labels)
plt.show()
plt.hist(df['classification'], bins=6)
plt.show()
sns.distplot(df['differences'])
plt.show()