我编写了python代码,以便从日志中提取密钥。并且使用相同的日志,它在一台机器上运行良好。但是当我在hadoop中运行它时,它失败了。我想在使用时有一些错误regex
。谁可以给我一些评论?regex
不能支持hadoop吗?
此python代码旨在提取qry
和rc
,并计算rc
的值,然后将其打印为qry query_count rc_count
。当在hadoop中运行时,它报告
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
。
我搜索谷歌,您的映射器代码中可能存在一些错误。那么我该如何解决?
这样的日志格式,
通知:01-03 23:57:23:[a.cpp] [b] [222] show_ver = 11 sid = ae1d esid = 6WVj uid = D1 a = 20 qry = cars qid0 = 293 loc_src = 4 phn = 0 mid = 0 wvar = c op = 0 qry_src = 0 op_type = 1 src = 110 | 120 | 111 at = 60942 rc = 3 | 1 | 1 discount = 20 indv_type = 0 rep_query =
我的python代码就是那个
import sys
import re
for line in sys.stdin:
count_result = 0
line = line.strip()
match=re.search('.*qry=(.*?)qid0.*rc=(.*?)discount',line).groups()
if (len(match)<2):
continue
counts_tmp = match[1].strip()
counts=counts_tmp.split('|')
for count in counts:
if count.isdigit():
count_result += int(count)
key_tmp = match[0].strip()
if key_tmp.strip():
key = key_tmp.split('\t')
key = ' '.join(key)
print '%s\t%s\t%s' %(key,1,count_result)
答案 0 :(得分:1)
最有可能的是,你的正则表达式会捕捉到你期望的更多。我建议把它分成一些更简单的部分,如:
(?<= qry=).*(?= quid0)
和
(?<= rc=).*(?= discount)
答案 1 :(得分:0)
采取了许多假设并冒犯了有根据的猜测,您可以像这样解析您的日志:
from collections import defaultdict
input = """NOTICE: 01-03 23:57:23: [a.cpp][b][222] show_ver=11 sid=ae1d esid=6WVj uid=D1 a=20 qry=cars qid0=293 loc_src=4 phn=0 mid=0 wvar=c op=0 qry_src=0 op_type=1 src=110|120|111 at=60942 rc=3|1|1 discount=20 indv_type=0 rep_query=
NOTICE: 01-03 23:57:23: [a.cpp][b][222] show_ver=11 sid=ae1d esid=6WVj uid=D1 a=20 qry=boats qid0=293 loc_src=4 phn=0 mid=0 wvar=c op=0 qry_src=0 op_type=1 src=110|120|111 at=60942 rc=3|5|2 discount=20 indv_type=0 rep_query=
NOTICE: 01-03 23:57:23: [a.cpp][b][222] show_ver=11 sid=ae1d esid=6WVj uid=D1 a=20 qry=cars qid0=293 loc_src=4 phn=0 mid=0 wvar=c op=0 qry_src=0 op_type=1 src=110|120|111 at=60942 rc=3|somestring|12 discount=20 indv_type=0 rep_query="""
d = defaultdict (lambda: 0)
for line in input.split ("\n"):
tokens = line.split (" ")
count = 0
qry = None
for token in tokens:
pair = token.split ("=")
if len (pair) != 2: continue
key, value = pair
if key == "qry":
qry = value
if key == "rc":
values = value.split ("|")
for value in values:
try: count += int (value)
except: pass
if qry: d [qry] += count
print (d)
假设(a)键值对用空格分隔,(b)键和值都没有空格。