我有像
这样的数据ID,"url","used_at","active_seconds"
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/videos168693045?section=all",2016-03-01 10:18:45,4
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com",2016-03-01 10:18:49,2
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/feed",2016-03-01 10:18:51,2
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios291781172",2016-03-01 10:18:53,2
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios291781172?q=Тимур%20Гатиятуллин%20%7C%20Честный%20-%20Улетай%20полная%20версия",2016-03-01 10:18:55,6
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios291781172?q=Тимур%20ГатиятуллинЧестный%20-%20Улетай%20полная%20версия",2016-03-01 10:19:01,2
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios291781172?q=Тимур%20Гатиятуллин%20Честный%20-%20Улетай%20полная%20версия",2016-03-01 10:19:03,4
d684cd5f0189ab49c391c2b7bcbac0cb,"vk.com/audios168693045?section=all",2016-03-01 10:19:07,2
我需要在包含id
的网址中计算audios
。
欲望输出:
d684cd5f0189ab49c391c2b7bcbac0cb: 291781172 - 4, 168693045 - 1, etc
我不知道如何在id
之后获得audio
并计算出来。
data = pd.read_csv("get_id.csv")
data_name = pd.read_excel("name.xlsx")
names_panel = data_name['Names']
urls = data['url']
ids = data['ID']
for url in urls:
if 'audios' in url:
print url
答案 0 :(得分:1)
print pd.concat([df['ID'], df['url'].str.extract('(?P<count>audios)(?P<digit>\d+)')], axis=1).groupby(['ID', 'digit']).count()
count
ID digit
d684cd5f0189ab49c391c2b7bcbac0cb 168693045 1
291781172 4
答案 1 :(得分:1)
我认为你需要str.extract
。然后是ID
no
和df[['no']] = df.url.str.extract(r'audios(\d+)?', expand=False)
print df
print df.groupby(['ID', 'no']).size().reset_index(name='count')
ID no count
0 d684cd5f0189ab49c391c2b7bcbac0cb 168693045 1
1 d684cd5f0189ab49c391c2b7bcbac0cb 291781172 4
新groupby
:
print df.groupby([df.ID, df.url.str.extract(r'audios(\d+)?', expand=False)])
.size().reset_index(name='count')
ID url count
0 d684cd5f0189ab49c391c2b7bcbac0cb 168693045 1
1 d684cd5f0189ab49c391c2b7bcbac0cb 291781172 4
或者没有创建新列:
as_index=False
我很少改进size
回答(为返回DataFrame
添加expand=False
并通过添加In [152]: %timeit pd.concat([df['ID'], df['url'].str.extract('(?P<count>audios)(?P<digit>\d+)', expand=False)], axis=1).groupby(['ID', 'digit'], as_index=False).count()
100 loops, best of 3: 3.5 ms per loop
In [153]: %timeit df.groupby([df.ID, df.url.str.extract(r'audios(\d+)?', expand=False)]).size().reset_index(name='count')
1000 loops, best of 3: 1.92 ms per loop
删除警告)然后比较解决方案:
<强>时序强>:
{{1}}
答案 2 :(得分:0)
这是一种非pythonic方式(使用循环)。
首先,IIUC你想要获得的数字总是有相同的长度,对吗?然后只需从你的网址中做一个列表,选择你想要的,然后从中创建一个字符串。
ids = df.ID.unique()
for identity in ids:
my_list = []
for url in urls:
if 'audios' in url:
my_list.append(''.join(list(url)[13:22]))
for number in set(my_list):
print(str(identity) + ': ' +number +': '+ str(my_list.count(number)))