当使用urlparser

时间:2017-12-12 18:42:23

标签: python urllib urlparse

我希望我的文件被url解析但是一些url缺少参数,当我在日志行中迭代时,我得到了缺少参数的错误。 我需要将空白或空值附加到解析列表中,以便我可以将其转换为数据框

我的数据文件:日志文件

"GET /pixel.gife=heartbeat&creative_id=33548&in_view_time=290"
"GET/pixel.gife=heartbeat&creative_id=33548&in_view_time=23988"
"GET /pixel.gif?e=heartbeat&creative_id=33548&in_view_time=19183"
"GET /pixel.gif?e=ad_load&creative_id=33548"

我希望输出为:

   E |  Creative ID | IN VIEW TIME

   heartbeat   33548    290

   heartbeat 33548 23988

   ad_load 33548 null

我的代码:

parselist = []
for eachline in log.readlines():
    ip_regex = re.findall(r'(\d{18})', eachline)
    date = re.findall(r'([0-9]{4}\-[0-9]{2}\-[0-9]{2})',eachline)
    url = eachline
    parsed = urlparse.urlparse(url)
    parselist.append(ip_regex)
    parselist.append(date)
    parselist.append(urlparse.parse_qs(parsed.query)['e'])
    parselist.append(urlparse.parse_qs(parsed.query)['account_id'])
    parselist.append(urlparse.parse_qs(parsed.query)['impression_id'])
    parselist.append(urlparse.parse_qs(parsed.query)['campaign_id'])
    parselist.append(urlparse.parse_qs(parsed.query)['creative_id'])
    parselist.append(urlparse.parse_qs(parsed.query)['in_view_time'])

我得到的错误因为在第三行中缺少in_view_time参数:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-405c1bfb329e> in <module>()
     12     parselist.append(urlparse.parse_qs(parsed.query)['campaign_id'])
     13     parselist.append(urlparse.parse_qs(parsed.query)['creative_id'])
---> 14     parselist.append(urlparse.parse_qs(parsed.query)['in_view_time'])

KeyError: 'in_view_time'

2 个答案:

答案 0 :(得分:0)

您可以使用tryexcept

parselist = []
for eachline in log.readlines():
    ip_regex = re.findall(r'(\d{18})', eachline)
    date = re.findall(r'([0-9]{4}\-[0-9]{2}\-[0-9]{2})',eachline)
    url = eachline
    parsed = urlparse.urlparse(url)
    parselist.append(ip_regex)
    parselist.append(date)
    try:
        parselist.append(urlparse.parse_qs(parsed.query)['e'])
    except:
        parselist.append('Null')
    try:
        parselist.append(urlparse.parse_qs(parsed.query)['account_id'])
    except:
        parselist.append('Null')
    try:
        parselist.append(urlparse.parse_qs(parsed.query)['impression_id'])
    except:
        parselist.append('Null')
    try:
        parselist.append(urlparse.parse_qs(parsed.query)['campaign_id'])
    except:
        parselist.append('Null')
    try:
        parselist.append(urlparse.parse_qs(parsed.query)['creative_id'])
    except:
        parselist.append('Null')
    try:
        parselist.append(urlparse.parse_qs(parsed.query)['in_view_time'])
    except:
        parselist.append('Null')

或者,以更紧凑的方式:

parselist = []
for eachline in log.readlines():
    ip_regex = re.findall(r'(\d{18})', eachline)
    date = re.findall(r'([0-9]{4}\-[0-9]{2}\-[0-9]{2})',eachline)
    url = eachline
    parsed = urlparse.urlparse(url)
    parselist.append(ip_regex)
    parselist.append(date)

    for key in ['e','account_id','impression_id','campaign_id','creative_id','in_view_time']:
        try:
            parselist.append(urlparse.parse_qs(parsed.query)[key])
        except:
            parselist.append('Null')

作为建议,您可以追加'Null'而不是None

答案 1 :(得分:-1)

  1. 你为什么要创建一个列表(你丢失密钥和公正的地方) 存储值)?
  2. 如果您只是对价值感兴趣,那么您 可以简单地写下面的内容:
  3. for v in urlparse.parse_qs(parsed.query).values():
        parselist.append(v)