Question

我是Spark的新手，想要读取日志文件并从中创建数据框。我的数据是半json，我无法正确地将其转换为数据帧。以下是文件中的第一行;

#ifndef QX_CVPR09_CTBF_BASIC_H
#define QX_CVPR09_CTBF_BASIC_H
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <numeric>
#include <vector>
#include <process.h>
#include <direct.h>
#include <io.h>
#include <time.h>
#include <string>
#include <memory.h>
#include <algorithm>
#include <functional>      // For greater<int>()
#include <iostream>
#if _MSC_VER > 1020   // if VC++ version is > 4.2
   using namespace std;  // std c++ libs implemented in std
#endif
#define QX_DEF_PADDING                  10
#define QX_DEF_THRESHOLD_ZERO           1e-6
class   qx_timer        {public: void start();  float stop(); void time_display(char *disp=""); void fps_display(char *disp=""); private: clock_t m_begin; clock_t m_end;};

看到第一部分是纯文本，而{}之间的最后一部分是json，我尝试了一些东西，首先将它转换为RDD然后映射并拆分然后转换回DataFrame，但我无法从Json部分提取值在这个上下文中有提取字段的技巧吗？

最终输出将是;

[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}

Answer 1

你只需要将Python UDF中的碎片解析成元组然后告诉spark将RDD转换为数据帧。最简单的方法可能是正则表达式。例如：

import re
import json

def parse(row):
    pattern = ' '.join([
        r'\[(?P<ts>\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\]',
        r'userid:(?P<userid>\d+)',
        r'(?P<ip>\d+\.\d+\.\d+\.\d+)',
        r'(?P<level>\w+)',
        r'(?P<json>.+$)'
    ])
    match = re.match(pattern, row)
    parsed_json = json.loads(match.group('json'))
    return (match.group('ts'), match.group('userid'), match.group('ip'), match.group('level'), parsed_json['artist'], parsed_json['song'], parsed_json['service'])


lines = [
'[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}'
]


rdd = sc.parallelize(lines)
df = rdd.map(parse).toDF(['ts', 'userid', 'ip', 'level', 'artist', 'song', 'service'])

df.show()

打印

+-------------------+------+-----------+-----+---------------+--------------------+-------+
|                 ts|userid|         ip|level|         artist|                song|service|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
|2017-01-06 07:00:01|444444|11.11.111.0| info|Tears For Fears|Everybody Wants T...|pandora|
+-------------------+------+-----------+-----+---------------+--------------------+-------+

Answer 2

我使用了以下内容，只是使用pyspark power解析一些;

parts=r1.map( lambda x: x.value.replace('[','').replace('] ','###')
     .replace(' userid:','###').replace('null','"null"').replace('""','"NA"')
     .replace(' music_info {"artist":"','###').replace('","album":"','###')
     .replace('","song":"','###').replace('","id":"','###')
     .replace('","service":"','###').replace('"}','###').split('###'))
people = parts.map(lambda p: (p[0], p[1],p[2], p[3], p[4], p[5], p[6], p[7]))

schemaString = "timestamp mac userid_ip artist album song id service"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]

有了这个我几乎得到了我想要的东西，而且表现非常快。

+-------------------+-----------------+--------------------+--------------------    +--------------------+--------------------+--------------------+-------+
|          timestamp|              mac|           userid_ip|              artist|               album|                song|                  id|service|
+-------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------+
|2017-01-01 00:00:00|00:00:00:00:00:00|111122  22.235.17...|The United States...|  This Is Christmas!|Do You Hear What ...|            S1112536|pandora|
|2017-01-01 00:00:00|00:11:11:11:11:11|123123 108.252.2...|                  NA|  Dinner Party Radio|                  NA|                null|pandora|

从日志文件中提取字段，其中数据存储了半个json和半个纯文本

2 个答案: