我正在尝试确定参数值的长度方差,并在相应参数/值组合的集合之后打印方差值。
例如,date
和date=2007-04-14
中date=2007-08-19
的方差值为0. id_eve
中id_eve=479989
的值{{1} }}和id_eve=47
将是2.88。
从Group values with common domain and page values开始,我们有一组URL被解析,以提供一组网址的参数/值。
示例数据集:
id_eve=479
由以下Python代码解析:
www.domain.com/page?id_eve=479989&adm=no
www.domain.com/page?id_eve=47&adm=yes
www.domain.com/page?id_eve=479
domain.com/cal?view=month
domain.com/cal?view=day
ww2.domain.com/cal?date=2007-04-14
ww2.domain.com/cal?date=2007-08-19
www.domain.edu/some/folder/image.php?l=adm&y=5&id=2&page=http%3A//support.domain.com/downloads/index.asp&unique=12345
blog.news.org/news/calendar.php?view=day&date=2011-12-10
www.domain.edu/some/folder/image.php?l=adm&y=5&id=2&page=http%3A//.domain.com/downloads/index.asp&unique=12345
blog.news.org/news/calendar.php?view=month&date=2011-12-10
提供:
from collections import defaultdict
from urllib import quote
from urlparse import parse_qsl, urlparse
urls = defaultdict(list)
with open('links.txt') as f:
for url in f:
parsed_url = urlparse(url.strip())
params = parse_qsl(parsed_url.query, keep_blank_values=True)
for key, value in params:
urls[parsed_url.path].append("%s=%s" % (key, quote(value)))
# printing results
for url, params in urls.iteritems():
print url
for param in params:
print param
所需的额外部分是每个参数/值组合打印参数值的长度变化,以便匹配参数与上面输出中定义的类似URL(希望清楚地阅读)。
所以期望的输出是:
ww2.domain.com/cal
date=2007-04-14
date=2007-08-19
www.domain.edu/some/folder/image.php
l=adm
y=5
id=2
page=http%3A//support.domain.com/downloads/index.asp
unique=12345
l=adm
y=5
id=2
page=http%3A//.domain.com/downloads/index.asp
unique=12345
domain.com/cal
view=month
view=day
www.domain.com/page
id_eve=479989
adm=no
id_eve=47
adm=yes
id_eve=479
blog.news.org/news/calendar.php
view=day
date=2011-12-10
view=month
date=2011-12-10
答案 0 :(得分:3)
from collections import defaultdict
from urllib import quote
from urlparse import parse_qsl, urlparse
我们需要能够计算方差:
def variance(values):
mean = sum(values) / float(len(values))
return sum((elem - mean)**2 for elem in values) / float(len(values))
我们希望按“密钥”分组,因此我们不会将"%s=%s"
添加到defaultdict
。
urls = defaultdict(lambda: defaultdict(list))
with open('links.txt') as f:
for url in f:
parsed_url = urlparse(url.strip())
params = parse_qsl(parsed_url.query, keep_blank_values=True)
for key, value in params:
urls[parsed_url.path][key].append(quote(value))
然后我们可以通过打印东西
for domain, keys in urls.items():
print domain
for key, values in keys.items():
for value in values:
print "%s=%s" % (key, value)
if len(values) > 1:
print variance(map(len, values))