试图在Python

时间:2015-08-05 12:39:00

标签: python parsing csv url

我有一个我要解析的网址列表,并在每个网址中找到utm代码。我首先想要找到utm之后的唯一值,即utm_source并使用每个值创建新列。我要找的最后一件事就像是

sourceUrl: https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en

源: 站点

介质: 电子邮件

系列: CAMPAIGN1

UUID: 999124

郎: 烯

现在我有以下内容:

import pandas as pd

email_list = pd.read_csv('/Users/rethompsoniii/Documents/Work-Related/Jeb 2016/email_list_20150804.csv', sep=',', header=0, error_bad_lines=False, index_col=False, dtype='unicode')

url = email_list['SourceUrl']

utms = url.split("utm",1)[1]

print(utms)

然而,utms线目前也失败了。没有找人给我所有的代码,只是指出我正确的方向。非常感谢

4 个答案:

答案 0 :(得分:3)

您可以使用urlparse库。

首先,您可以使用urlparse.urlparse()函数将网址解析为相应的组件。

>>> import urlparse
>>> url = "https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en"
>>> parsed_url = urlparse.urlparse(url)
>>> parsed_url
ParseResult(scheme='https', netloc='website.com', path='/donate', params='', query='utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en', fragment='')
>>> parsed_url.query
'utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'

从解析后的网址中,您可以使用其他函数urlparse.parse_qs()

解析查询
>>> parsed_query = urlparse.parse_qs(parsed_url.query)
>>> parsed_query
{'lang': ['en'], 'utm_campaign': ['campaign1'], 'utm_medium': ['email'], 'uuid': ['999124'], 'utm_source': ['site']}

答案 1 :(得分:1)

您可以使用正则表达式。

import re
m = re.findall('utm_(\w+)=(\w+)', 'https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en')

'm'现在是一个包含元组的列表:

[('source', 'site'), ('medium', 'email'), ('campaign', 'campaign1')]

但请考虑一下Peter Wood在评论中提到的urlparse。

答案 2 :(得分:1)

您可以使用python urlparse库。

<强>示例:

import urlparse
url = 'https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'
params = dict(urlparse.parse_qsl(urlparse.urlsplit(url).query))
new_params = {key[4:] if key.startswith('utm_') else key:value for key, value in params.iteritems()}
print new_params

<强>输出:

{'lang': 'en', 'source': 'site', 'medium': 'email', 'uuid': '999124', 'campaign': 'campaign1'}

答案 3 :(得分:1)

您可以使用内置库urlparse

首先parse the url

>>> from urlparse import urlparse, parse_qs
>>> url = ('https://website.com/donate?utm_source=site&'
           'utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en')

>>> parsed = urlparse(url)
>>> parsed.query
'utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'

然后使用urlparse.parse_qs解析查询字符串:

>>> parse_qs(parsed.query)
{'lang': ['en'],
 'utm_campaign': ['campaign1'],
 'utm_medium': ['email'],
 'utm_source': ['site'],
 'uuid': ['999124']}