我有这样的数据:
year = ['2010', '2011-2014', '2013', '2012-2016', '2018-present', '2019', '2015-present', '2015']
products = ['A', 'B', 'C', 'D', 'B', 'E', 'F', 'A']
rating = [4, 2, 2, 3, 1, 1, 2, 2]
data = pd.DataFrame({'Products': products, 'Year': year, 'Rating': rating})
在我的分析中,我想将年份范围转换为单年值(例如['2010', '2011', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']
),对于其他列,请添加年份范围中的计数。例如,对于上面的示例,我想要:
{'2010':'A','2011':'B','2013':'B','2014':'B','2013':'c','2012':'D',
'2013':'D','2014':'D','2015':'D','2016':'D',...}
我相信我需要pandas.cut
来进行装箱,但是我不知道如何在大熊猫中进行装箱
答案 0 :(得分:3)
使用explode
:
# Extract the range information from the Year column
y = data['Year'].str.extract('(?P<From>\d+)-?(?P<To>\d+|present)?')
y['To'] = y['To'].combine_first(y['From']).replace({'present': '2020'})
y = y.astype('int')
y['Range'] = y.apply(lambda row: range(row['From'], row['To']+1), axis=1)
# The explosion
data['Range'] = y['Range']
data = data.explode('Range')
结果:
Products Year Rating Range
A 2010 4 2010
B 2011-2014 2 2011
B 2011-2014 2 2012
B 2011-2014 2 2013
B 2011-2014 2 2014
C 2013 2 2013
D 2012-2016 3 2012
D 2012-2016 3 2013
D 2012-2016 3 2014
D 2012-2016 3 2015
D 2012-2016 3 2016
B 2018-present 1 2018
B 2018-present 1 2019
B 2018-present 1 2020
E 2019 1 2019
F 2015-present 2 2015
F 2015-present 2 2016
F 2015-present 2 2017
F 2015-present 2 2018
F 2015-present 2 2019
F 2015-present 2 2020
A 2015 2 2015
根据需要重命名列
答案 1 :(得分:3)
IIUC,您可以str.split
列Year
,然后在某些条件下使用列表理解:
df["Year"] = [list(range(int(i[0]), int(i[1] if i[1]!= "present" else "2020")+1))
if len(i)>1 else list(range(int(i[0]), int(i[0])+1))
for i in df["Year"].str.split("-")]
print (df.explode("Year"))
Products Year Rating
0 A 2010 4
1 B 2011 2
1 B 2012 2
1 B 2013 2
1 B 2014 2
2 C 2013 2
3 D 2012 3
3 D 2013 3
3 D 2014 3
3 D 2015 3
3 D 2016 3
4 B 2018 1
4 B 2019 1
4 B 2020 1
5 E 2019 1
6 F 2015 2
6 F 2016 2
6 F 2017 2
6 F 2018 2
6 F 2019 2
6 F 2020 2
7 A 2015 2
答案 2 :(得分:0)
一个简单的解决方案如下:)
data[["start", "end"]] = data["Year"].str.split('-',expand=True).ffill(axis=1)
data["end"] = data["end"].replace({"present":pd.Timestamp("now").year})
data[["start", "end"]] = data[["start", "end"]].astype(int)
data = data.drop("Year", axis=1)
data = data.loc[data.index.repeat(data.end - data.start + 1)].reset_index(drop=True)
data["counter"] = data.groupby(["Products", "start"]).cumcount()
data["Year"] = data["start"] + data["counter"]
data = data.drop(["start", "end", "counter"], axis=1)
答案 3 :(得分:0)
df1 = df['Year'].str.split("-", expand = True)\
.rename(columns={0:'Year1', 1:'Year2'}) #For Splitting into columns
df2 = pd.concat([df,df1], axis=1) #Merging
def a(b):
if b['Year2'] == None:
return b['Year1']
if b['Year2'] == 'present':
return 2020
else:
return b['Year2']
df2['Year3'] = df2.apply(a, axis=1) #Conditional replacement
df2['Year1'] = df2['Year1'].astype(int) #Character --> Integer
df2['Year3'] = df2['Year3'].astype(int) #Character --> Integer
df2['Year4'] = [np.arange(f,t+1) for f,t in zip(df2['Year1'], df2['Year3'])]
#For loop for number arrangement
df3 = df2.explode('Year4').drop(columns=['Year', 'Year2', 'Year3', 'Year1'])
#Explode --> List to Rows + Drop unwanted columns
df4 = df3[['Products']+['Year4']+['Rating']] #Rearranging
print(df4)