Question

我试图通过形成键值对来读取日志行，但是我收到错误。这是我的代码：

logLine=sc.textFile("C:\TestLogs\testing.log").cache() 
lines = logLine.flatMap(lambda x: x.split('\n'))
rx = "(\\S+)=(\\S+)" 
line_collect = lines.collect() 
for line in line_collect :  
    d = dict([(x,y) for x,y in re.findall(rx,line)])    
    d = str(d)  
    print d

错误：

line_collect = lines.collect（）...... InvalidInputException：输入路径不存在：file：/ C：/ TestLogs \ testing.log

我不知道如何纠正这个问题。我是python和spark的新手。

Answer 1

当在字符串中找到字符序列\t时，它将被替换为TAB字符。您实际上可以在错误消息中看到这一点。

我建议始终使用正斜杠/作为目录分隔符，即使在Windows上也是如此。或者在字符串前面添加如下的r：r"does not replace \t with <tab>."。

您可能希望阅读字符串文字：https://docs.python.org/2.0/ref/strings.html。

Answer 2

尝试用{替换logLine=sc.textFile("C:\TestLogs\testing.log").cache() logLine=sc.textFile("C:\\TestLogs\\testing.log").cache()

字符串中的反斜杠字符不是'\'，而是"\\"

即使文件在正确的位置-pyspark中提到，输入文件也不存在

2 个答案: