我遇到了这个问题:我需要在用户第一次点击电子邮件(变量发送)时找到它,并在发生时在相应的行中放置一个。
数据集有数千名用户(哈希),他们在简报中点击了电子邮件的一部分。我尝试通过发送,哈希对它们进行分组,然后找到最早的日期,但无法使其工作。
所以我选择了一个令人讨厌的解决方案,然而这回归奇怪的事情:
我的数据集(相关变量):
>>> clicks[['datetime','hash','sending']].head()
datetime hash sending
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d 5
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477 5
3 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7 5
4 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad 5
有6轮发送,datetime
为datetime64[ns]
。
我这样做的方法如下:
clicks['first'] = 0
for hash in clicks['hash'].unique():
t = clicks.ix[clicks.hash==hash, ['hash','datetime','sending']]
part = t['sending'].unique()
for i in part:
temp = t.ix[t.sending == i,'datetime']
clicks.ix[t[t.datetime == np.min(temp)].index.values,'first']=1
首先,我不认为它是非常pythonic,并且非常慢。但大多数情况下它会返回奇怪的类型!有0.0
和1.0
值,但我无法使用它们:
>>> type(clicks.first)
<type 'instancemethod'>
>>> clicks.loc[clicks.first==1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1296, in __getitem__
return self._getitem_axis(key, axis=0)
File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1467, in _getitem_axis
return self._get_label(key, axis=axis)
File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 93, in _get_label
return self.obj._xs(label, axis=axis)
File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1749, in xs
loc = self.index.get_loc(key)
File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/indexes/base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 156, in pandas.index.IndexEngine.get_loc (pandas/index.c:3977)
File "pandas/index.pyx", line 373, in pandas.index.Int64Engine._check_type (pandas/index.c:7634)
KeyError: False
-----更新:------
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
答案 0 :(得分:4)
我认为您需要groupby
apply
,其中minimal
和输出的比较值是布尔值 - 需要转换为int
0
和{{1} } astype
:
1
clicks = pd.DataFrame({'hash': {0: '0b1f4745df5925dfb1c8f53a56c43995', 1: '0a73d5953ebf5826fbb7f3935bad026d', 2: '605cebbabe0ba1b4248b3c54c280b477', 3: '0b1f4745df5925dfb1c8f53a56c43995', 4: '0a73d5953ebf5826fbb7f3935bad026d', 5: '605cebbabe0ba1b4248b3c54c280b477', 6: 'd26d61fb10c834292803b247a05b6cb7', 7: '48f8ab83e8790d80af628e391f3325ad'}, 'sending': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5}, 'datetime': {0: pd.Timestamp('2016-11-01 19:13:34'), 1: pd.Timestamp('2016-11-01 10:47:14'), 2: pd.Timestamp('2016-10-31 19:09:21'), 3: pd.Timestamp('2016-11-01 19:13:34'), 4: pd.Timestamp('2016-11-01 11:47:14'), 5: pd.Timestamp('2016-10-31 19:09:20'), 6: pd.Timestamp('2016-10-31 13:42:36'), 7: pd.Timestamp('2016-10-31 10:46:30')}})
print (clicks)
datetime hash sending
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d 5
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477 5
3 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5
4 2016-11-01 11:47:14 0a73d5953ebf5826fbb7f3935bad026d 5
5 2016-10-31 19:09:20 605cebbabe0ba1b4248b3c54c280b477 5
6 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7 5
7 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad 5
-----更新:------
#if column dtype of column datetime is not datetime (with this sample not necessary)
clicks.datetime = pd.to_datetime(clicks.datetime)
clicks['first'] = clicks.groupby(['hash','sending'])['datetime'] \
.apply(lambda x: x == x.min()) \
.astype(int)
print (clicks)
datetime hash sending first
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5 1
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d 5 1
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477 5 0
3 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5 1
4 2016-11-01 11:47:14 0a73d5953ebf5826fbb7f3935bad026d 5 0
5 2016-10-31 19:09:20 605cebbabe0ba1b4248b3c54c280b477 5 1
6 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7 5 1
7 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad 5 1
答案 1 :(得分:1)
注意:我不熟悉pandas模块,但我经常使用python(它是系统工程)
为什么不使用datetime模块?您可以根据时间戳轻松对其进行排序。例如:
{
"TableName": "Comments",
"Item": {
"commentId": {
"S": "$context.requestId"
},
"pageId": {
"S": "$input.path('$.pageId')"
},
"userName": {
"S": "$input.path('$.userName')"
},
"message": {
"S": "$input.path('$.message')"
}
}
}
如您所见,使用日期时间模块对日期进行排序很容易。编写比较函数似乎很简单,并根据日期对它们进行排序以找到最早的事件。