我有一个反向访问日志文件,已将其处理为URL匹配和用户类型。我需要计算每种用户类型点击给定URL的次数。 样本数据:
http://find.galegroup.com:80/工作人员 http://www.transnational-dispute-management.com:80/学生 https://www.investorstatelawguide.com:443/辅助访问 https://www.jstor.org:443/教师 https://bmo.bmiresearch.com:443/主库 https://heinonline.org:443/在校园 http://find.galegroup.com:80/学生
我当时正在考虑将每个网址作为一个元组,并带有针对每种用户类型的计数器。在读取每一行时,将针对先前的匹配进行测试-如果没有任何匹配,则会启动一个新的元组。如果匹配,则适当的计数器增加,并重新保存元组。
最后,所有元组都被写到一个新文件中。
问题是我不知道如何实现它。
非常感谢指针,一般策略和答案!
答案 0 :(得分:0)
如果您希望使用正则表达式来执行此任务,我们可以简单地使用诸如
的替换定义一个简单的表达式。(galegroup\.com)|(jstor\.org)|(investorstatelawguide\.com)
捕获我们想要的域,然后我们就算:
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(galegroup\.com)|(jstor\.org)|(investorstatelawguide\.com)"
test_str = ("http://find.galegroup.com:80/ staff http://www.transnational-dispute-management.com:80/ student https://www.investorstatelawguide.com:443/ AdjunctVisiting https://www.jstor.org:443/ faculty https://bmo.bmiresearch.com:443/ mainlibrary https://heinonline.org:443/ oncampus http://find.galegroup.com:80/ student\n"
"http://find.galegroup.com:80/ staff http://www.transnational-dispute-management.com:80/ student https://www.investorstatelawguide.com:443/ AdjunctVisiting https://www.jstor.org:443/ faculty https://bmo.bmiresearch.com:443/ mainlibrary https://heinonline.org:443/ oncampus http://find.galegroup.com:80/ student\n"
"http://find.galegroup.com:80/ staff http://www.transnational-dispute-management.com:80/ student https://www.investorstatelawguide.com:443/ AdjunctVisiting https://www.jstor.org:443/ faculty https://bmo.bmiresearch.com:443/ mainlibrary https://heinonline.org:443/ oncampus http://find.galegroup.com:80/ student")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
jex.im可视化正则表达式: