使用python计算表中某些id的值

时间:2016-05-11 08:44:59

标签: python pandas

我有像

这样的数据
ID,"url","used_at","active_seconds"
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/videos168693045?section=all",2016-03-01 10:18:45,4
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com",2016-03-01 10:18:49,2
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/feed",2016-03-01 10:18:51,2
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios291781172",2016-03-01 10:18:53,2
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios291781172?q=Тимур%20Гатиятуллин%20%7C%20Честный%20-%20Улетай%20полная%20версия",2016-03-01 10:18:55,6
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios291781172?q=Тимур%20ГатиятуллинЧестный%20-%20Улетай%20полная%20версия",2016-03-01 10:19:01,2
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios291781172?q=Тимур%20Гатиятуллин%20Честный%20-%20Улетай%20полная%20версия",2016-03-01 10:19:03,4
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios168693045?section=all",2016-03-01 10:19:07,2

我需要在包含id的网址中计算audios。 欲望输出:

d684cd5f0189ab49c391c2b7bcbac0cb: 291781172 - 4, 168693045 - 1, etc

我不知道如何在id之后获得audio并计算出来。

data = pd.read_csv("get_id.csv")
data_name = pd.read_excel("name.xlsx")
names_panel = data_name['Names']
urls = data['url']
ids = data['ID']
for url in urls:
    if 'audios' in url:
        print url

3 个答案:

答案 0 :(得分:1)

print pd.concat([df['ID'], df['url'].str.extract('(?P<count>audios)(?P<digit>\d+)')], axis=1).groupby(['ID', 'digit']).count()

                                            count
ID                               digit           
d684cd5f0189ab49c391c2b7bcbac0cb 168693045      1
                                 291781172      4

答案 1 :(得分:1)

我认为你需要str.extract。然后是ID nodf[['no']] = df.url.str.extract(r'audios(\d+)?', expand=False) print df print df.groupby(['ID', 'no']).size().reset_index(name='count') ID no count 0 d684cd5f0189ab49c391c2b7bcbac0cb 168693045 1 1 d684cd5f0189ab49c391c2b7bcbac0cb 291781172 4 groupby

print df.groupby([df.ID, df.url.str.extract(r'audios(\d+)?', expand=False)])
        .size().reset_index(name='count')
                                 ID        url  count
0  d684cd5f0189ab49c391c2b7bcbac0cb  168693045      1
1  d684cd5f0189ab49c391c2b7bcbac0cb  291781172      4

或者没有创建新列:

as_index=False

我很少改进size回答(为返回DataFrame添加expand=False并通过添加In [152]: %timeit pd.concat([df['ID'], df['url'].str.extract('(?P<count>audios)(?P<digit>\d+)', expand=False)], axis=1).groupby(['ID', 'digit'], as_index=False).count() 100 loops, best of 3: 3.5 ms per loop In [153]: %timeit df.groupby([df.ID, df.url.str.extract(r'audios(\d+)?', expand=False)]).size().reset_index(name='count') 1000 loops, best of 3: 1.92 ms per loop 删除警告)然后比较解决方案:

<强>时序

{{1}}

答案 2 :(得分:0)

这是一种非pythonic方式(使用循环)。

首先,IIUC你想要获得的数字总是有相同的长度,对吗?然后只需从你的网址中做一个列表,选择你想要的,然后从中创建一个字符串。

ids = df.ID.unique()
for identity in ids:
    my_list = []
    for url in urls:
        if 'audios' in url:
            my_list.append(''.join(list(url)[13:22]))
    for number in set(my_list):
        print(str(identity) + ': ' +number +': '+ str(my_list.count(number)))