我有一个看起来像
的字符串rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036
现在,我想做的是
extract timestamp: 134049600
event: EP002960010145
现在isseue在tmsid之后有%3D 我甚至不知道它是什么..但无论如何,有时它的%3D%6D和我认为它甚至可以%16D ???我无法确定那个
是否有一种强大的方法可以处理上述字符串中的这两个字段?
感谢
答案 0 :(得分:3)
您正在查看网址引用的数据:
>>> from urllib2 import unquote
>>> unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036')
'rand_id:?tmsid=1340496000_EP002960010145_11_0_10050_1_2_10036'
您可以拆分第一个=
,然后拆分_
:
>>> unquoted = unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036')
>>> unquoted.split('=', 1)[1].split('_')
['1340496000', 'EP002960010145', '11', '0', '10050', '1', '2', '10036']
>>> timestamp, event = unquoted.split('=', 1)[1].split('_')[:2]
>>> timestamp, event
('1340496000', 'EP002960010145')
如果数据有多个字段并且您在那里找到了&
,那么您可以更好地将问号之后的所有内容解析为URL查询字符串,而不是使用urlparse.parse_qs()
>>> from urlparse import parse_qs
>>> parse_qs(unquoted.split('?', 1)[1])
{'tmsid': ['1340496000_EP002960010145_11_0_10050_1_2_10036']}
>>> parsed = parse_qs(unquoted.split('?', 1)[1])
>>> timestamp, event = parsed['tmsid'][0].split('_', 2)[:2]
>>> timestamp, event
('1340496000', 'EP002960010145')