熊猫数据框计算每个域的旅程数和旅程中的URL数

时间:2019-03-08 07:18:11

标签: python pandas

我有以下格式的熊猫数据框

    url                                     visitId      refId    timestamp
0   https://google.com/kindle               1            0        2019-03-08 07:01:40
1   https://google.com/                     2            1        2019-03-08 07:01:48
2   https://google.com/subscribe-and-save   3            2        2019-03-08 07:01:52
3   https://docs.google.com/abc             4            3        2019-03-08 07:02:23
4   https://youtube.com/music               5            4        2019-03-08 07:03:01
5   https://google.com/kindle               6            0        2019-03-08 07:08:00
6   https://example.com/                    7            0        2019-03-08 10:10:11
7   https://example.com/                    8            0        2019-03-08 10:11:00
8   https://example.com/                    9            8        2019-03-08 10:11:00
9   https://example.com/                    10           9        2019-03-08 10:11:00

refId = 0表示访问是直接访问,而非零值是指向URL的visitId。旅程领域是旅程中最频繁的领域。现在这是一个旅程。数据框由多个此类旅程组成,我需要在每个域的此类旅程的数目和旅程中的url数量中对其价值进行计数。

上面数据框的预期输出。

    Domain            Journey            Count
0   google.com        1                  5
1   example.com       1                  3

我已经解决了将熊猫数据框转换为字典列表的问题,如下所示。

df = [{'url':'https://google.com/kindle','visitId':1,'refId':0,'timestamp':datetime.datetime(2019,3,8,7,1,40)},
        {'url':'https://google.com/','visitId':2,'refId':1,'timestamp':datetime.datetime(2019,3,8,7,1,48)},
        {'url':'https://amazon.com/subscribe-and-save','visitId':3,'refId':2,'timestamp':datetime.datetime(2019,3,8,7,1,52)},
        {'url':'https://amazon.com/abc','visitId':4,'refId':3,'timestamp':datetime.datetime(2019,3,8,7,2,23)},
        {'url':'https://youtube.com/music','visitId':5,'refId':4,'timestamp':datetime.datetime(2019,3,8,7,3,1)},
        {'url':'https://amazon.com/kindle','visitId':6,'refId':0,'timestamp':datetime.datetime(2019,3,8,7,8,0)},
        {'url':'https://example.com/','visitId':7,'refId':0,'timestamp':datetime.datetime(2019,3,8,10,10,11)},
        {'url':'https://example.com/','visitId':8,'refId':0,'timestamp':datetime.datetime(2019,3,8,10,11,0)},
        {'url':'https://example.com/','visitId':9,'refId':8,'timestamp':datetime.datetime(2019,3,8,10,11,0)},
        {'url':'https://example.com/','visitId':10,'refId':9,'timestamp':datetime.datetime(2019,3,8,10,11,0)}]

然后,我使用了以下方法。

import datetime
import tldextract

def Journey():
    dat = [item for item in df if item['refId']!=0]

    dat = sorted(dat, key=lambda k: (k['timestamp']),reverse=True)

    repCheck = set()

    final = []
    count = 1

    visitIdToDocument = {}
    refIdToDocument = {}

    for item in df:
        visitIdToDocument[item['visitId']] = {'url':item['url'],'refId':item['refId']}

    for item in dat:
        refIdToDocument[item['refId']] = {'url':item['url'],'visitId':item['visitId']}

    for item in dat:
        lst = []
        rId = item['refId']
        flag = False
        try:
            vId = refIdToDocument[rId]['visitId']
            if vId not in repCheck:
                repCheck.add(vId)
                flag = True
        except KeyError: 
            pass
        repCheck2 = set()
        if flag:
            while True:
                if vId in repCheck2:
                    break
                else:
                    repCheck2.add(vId)
                try:
                    lst.append(visitIdToDocument[vId])
                    vId = rId
                    repCheck.add(vId)
                    rId = visitIdToDocument[vId]['refId']
                except KeyError:
                    if vId != '0' and vId != 0 and vId in visitIdToDocument:
                        lst.append(visitIdToDocument[vId])
                    break

            if len(lst) > 2:
                appendDict = {}
                journeyDomainLst = []
                jdla = ""
                for i in lst[::-1]:
                    jdla = "{}.{}".format((tldextract.extract(i['url'])).domain, (tldextract.extract(i['url'])).suffix)
                    if jdla[-1] == '.':
                        jdla = jdla[:-1]
                    journeyDomainLst.append(jdla)
                appendDict = {'Journey':1,'Domain':max(set(journeyDomainLst), key=journeyDomainLst.count),'Count':len(lst[::-1])}
                final.append(appendDict)
                count = count+1

    domains = {d['Domain'] for d in final}

    final=[{'Domain':dom, 'Journey': sum(d['Journey'] for d in final if d['Domain']==dom), 'Count': sum(d['Count'] for d in final if d['Domain']==dom)} for dom in domains]
    return final

结果

[{'Domain': 'google.com', 'Journey': 1, 'Count': 5},
{'Domain': 'example.com', 'Journey': 1, 'Count': 3}]

使用pandas数据框本身可以获得相同的结果吗?并且不使用字典列表和其他创建的字典吗?如果是,那我该怎么办?

0 个答案:

没有答案