我正在尝试将转换应用于熊猫中的groupby
对象。
代码如下:
df = pd.DataFrame({
'id':['012', '013', '014', '014', '015', '015', '016', '016', '017', '017'],
'date': pd.to_datetime(
['2008-11-05', 'NaT', 'NaT', '2008-11-05', 'NaT', '2008-11-05',
'NaT', '2008-11-05', 'NaT', '2008-11-05']),
'grade': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan],
'length': [1, 2, 3, 4, 5, 6, 7, 8, np.nan, 10]})
df['uuid'] = np.nan
df
Out[7]:
id date grade length uuid
0 012 2008-11-05 NaN 1.0 NaN
1 013 NaT NaN 2.0 NaN
2 014 NaT NaN 3.0 NaN
3 014 2008-11-05 NaN 4.0 NaN
4 015 NaT NaN 5.0 NaN
5 015 2008-11-05 NaN 6.0 NaN
6 016 NaT NaN 7.0 NaN
7 016 2008-11-05 NaN 8.0 NaN
8 017 NaT NaN NaN NaN
9 017 2008-11-05 NaN 10.0 NaN
In[8]:
df.groupby(['id', 'date']).uuid.transform(lambda g: uuid.uuid4())
Out[9]:
...
...
ValueError: Length mismatch: Expected axis has 5 elements, new values have 10 elements
类似于this问题,我认为问题出在日期列中的NaT
,所以我尝试了df.fillna('nan')
。不幸的是,这引发了同样的错误-这是因为date列将字符串'nan'
识别为np.nan
吗?
我尝试填充字符串'nullv'
,这使我'ValueError: could not convert string to Timestamp'
。
所以,我当前的解决方案如下:
df['uuid'] = np.nan
df.date = df.date.astype('str')
df.uuid = df.groupby(['id', 'date']).uuid.transform(lambda g: uuid.uuid4())
df.date = pd.to_datetime(df.date)
df
Out[9]:
id date grade length uuid
0 012 2008-11-05 NaN 1.0 267b9c5f-41d9-4a8c-91af-aaa2dbddc911
1 013 NaT NaN 2.0 0e7ae8fa-cf64-4c3a-abd8-85d40b6253a4
2 014 NaT NaN 3.0 d1de91d8-099e-492c-8434-94ebd269280f
3 014 2008-11-05 NaN 4.0 91b42203-1a31-4dfe-8566-abba3686734f
4 015 NaT NaN 5.0 6a83b025-98c4-4196-8bfb-1ca88e426d8b
5 015 2008-11-05 NaN 6.0 d0ba9dfc-fa2b-4a1f-995b-66f798bd9259
6 016 NaT NaN 7.0 67a26331-03de-440e-8958-89a375007535
7 016 2008-11-05 NaN 8.0 ca94c6f2-1520-4162-94cf-cf4536fb8828
8 017 NaT NaN NaN 133da892-a0ef-4fa3-9557-14049e8f3b66
9 017 2008-11-05 NaN 10.0 4a19db2b-0166-45e0-aff0-54f83b479507
除了转换为字符串然后再次返回,肯定还有另一种方法吗?