我有以下格式的熊猫数据框
url visitId refId timestamp
0 https://google.com/kindle 1 0 2019-03-08 07:01:40
1 https://google.com/ 2 1 2019-03-08 07:01:48
2 https://google.com/subscribe-and-save 3 2 2019-03-08 07:01:52
3 https://docs.google.com/abc 4 3 2019-03-08 07:02:23
4 https://youtube.com/music 5 4 2019-03-08 07:03:01
5 https://google.com/kindle 6 0 2019-03-08 07:08:00
6 https://example.com/ 7 0 2019-03-08 10:10:11
7 https://example.com/ 8 0 2019-03-08 10:11:00
8 https://example.com/ 9 8 2019-03-08 10:11:00
9 https://example.com/ 10 9 2019-03-08 10:11:00
refId = 0
表示访问是直接访问,而非零值是指向URL的visitId。旅程领域是旅程中最频繁的领域。现在这是一个旅程。数据框由多个此类旅程组成,我需要在每个域的此类旅程的数目和旅程中的url数量中对其价值进行计数。
上面数据框的预期输出。
Domain Journey Count
0 google.com 1 5
1 example.com 1 3
我已经解决了将熊猫数据框转换为字典列表的问题,如下所示。
df = [{'url':'https://google.com/kindle','visitId':1,'refId':0,'timestamp':datetime.datetime(2019,3,8,7,1,40)},
{'url':'https://google.com/','visitId':2,'refId':1,'timestamp':datetime.datetime(2019,3,8,7,1,48)},
{'url':'https://amazon.com/subscribe-and-save','visitId':3,'refId':2,'timestamp':datetime.datetime(2019,3,8,7,1,52)},
{'url':'https://amazon.com/abc','visitId':4,'refId':3,'timestamp':datetime.datetime(2019,3,8,7,2,23)},
{'url':'https://youtube.com/music','visitId':5,'refId':4,'timestamp':datetime.datetime(2019,3,8,7,3,1)},
{'url':'https://amazon.com/kindle','visitId':6,'refId':0,'timestamp':datetime.datetime(2019,3,8,7,8,0)},
{'url':'https://example.com/','visitId':7,'refId':0,'timestamp':datetime.datetime(2019,3,8,10,10,11)},
{'url':'https://example.com/','visitId':8,'refId':0,'timestamp':datetime.datetime(2019,3,8,10,11,0)},
{'url':'https://example.com/','visitId':9,'refId':8,'timestamp':datetime.datetime(2019,3,8,10,11,0)},
{'url':'https://example.com/','visitId':10,'refId':9,'timestamp':datetime.datetime(2019,3,8,10,11,0)}]
然后,我使用了以下方法。
import datetime
import tldextract
def Journey():
dat = [item for item in df if item['refId']!=0]
dat = sorted(dat, key=lambda k: (k['timestamp']),reverse=True)
repCheck = set()
final = []
count = 1
visitIdToDocument = {}
refIdToDocument = {}
for item in df:
visitIdToDocument[item['visitId']] = {'url':item['url'],'refId':item['refId']}
for item in dat:
refIdToDocument[item['refId']] = {'url':item['url'],'visitId':item['visitId']}
for item in dat:
lst = []
rId = item['refId']
flag = False
try:
vId = refIdToDocument[rId]['visitId']
if vId not in repCheck:
repCheck.add(vId)
flag = True
except KeyError:
pass
repCheck2 = set()
if flag:
while True:
if vId in repCheck2:
break
else:
repCheck2.add(vId)
try:
lst.append(visitIdToDocument[vId])
vId = rId
repCheck.add(vId)
rId = visitIdToDocument[vId]['refId']
except KeyError:
if vId != '0' and vId != 0 and vId in visitIdToDocument:
lst.append(visitIdToDocument[vId])
break
if len(lst) > 2:
appendDict = {}
journeyDomainLst = []
jdla = ""
for i in lst[::-1]:
jdla = "{}.{}".format((tldextract.extract(i['url'])).domain, (tldextract.extract(i['url'])).suffix)
if jdla[-1] == '.':
jdla = jdla[:-1]
journeyDomainLst.append(jdla)
appendDict = {'Journey':1,'Domain':max(set(journeyDomainLst), key=journeyDomainLst.count),'Count':len(lst[::-1])}
final.append(appendDict)
count = count+1
domains = {d['Domain'] for d in final}
final=[{'Domain':dom, 'Journey': sum(d['Journey'] for d in final if d['Domain']==dom), 'Count': sum(d['Count'] for d in final if d['Domain']==dom)} for dom in domains]
return final
结果
[{'Domain': 'google.com', 'Journey': 1, 'Count': 5},
{'Domain': 'example.com', 'Journey': 1, 'Count': 3}]
使用pandas数据框本身可以获得相同的结果吗?并且不使用字典列表和其他创建的字典吗?如果是,那我该怎么办?