我是Spark的新手,想要读取日志文件并从中创建数据框。我的数据是半json,我无法正确地将其转换为数据帧。以下是文件中的第一行;
#ifndef QX_CVPR09_CTBF_BASIC_H
#define QX_CVPR09_CTBF_BASIC_H
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <numeric>
#include <vector>
#include <process.h>
#include <direct.h>
#include <io.h>
#include <time.h>
#include <string>
#include <memory.h>
#include <algorithm>
#include <functional> // For greater<int>()
#include <iostream>
#if _MSC_VER > 1020 // if VC++ version is > 4.2
using namespace std; // std c++ libs implemented in std
#endif
#define QX_DEF_PADDING 10
#define QX_DEF_THRESHOLD_ZERO 1e-6
class qx_timer {public: void start(); float stop(); void time_display(char *disp=""); void fps_display(char *disp=""); private: clock_t m_begin; clock_t m_end;};
看到第一部分是纯文本,而{}之间的最后一部分是json,我尝试了一些东西,首先将它转换为RDD然后映射并拆分然后转换回DataFrame,但我无法从Json部分提取值在这个上下文中有提取字段的技巧吗?
最终输出将是;
[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}
答案 0 :(得分:0)
你只需要将Python UDF中的碎片解析成元组然后告诉spark将RDD转换为数据帧。最简单的方法可能是正则表达式。例如:
import re
import json
def parse(row):
pattern = ' '.join([
r'\[(?P<ts>\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\]',
r'userid:(?P<userid>\d+)',
r'(?P<ip>\d+\.\d+\.\d+\.\d+)',
r'(?P<level>\w+)',
r'(?P<json>.+$)'
])
match = re.match(pattern, row)
parsed_json = json.loads(match.group('json'))
return (match.group('ts'), match.group('userid'), match.group('ip'), match.group('level'), parsed_json['artist'], parsed_json['song'], parsed_json['service'])
lines = [
'[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}'
]
rdd = sc.parallelize(lines)
df = rdd.map(parse).toDF(['ts', 'userid', 'ip', 'level', 'artist', 'song', 'service'])
df.show()
打印
+-------------------+------+-----------+-----+---------------+--------------------+-------+
| ts|userid| ip|level| artist| song|service|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
|2017-01-06 07:00:01|444444|11.11.111.0| info|Tears For Fears|Everybody Wants T...|pandora|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
答案 1 :(得分:0)
我使用了以下内容,只是使用pyspark power解析一些;
parts=r1.map( lambda x: x.value.replace('[','').replace('] ','###')
.replace(' userid:','###').replace('null','"null"').replace('""','"NA"')
.replace(' music_info {"artist":"','###').replace('","album":"','###')
.replace('","song":"','###').replace('","id":"','###')
.replace('","service":"','###').replace('"}','###').split('###'))
people = parts.map(lambda p: (p[0], p[1],p[2], p[3], p[4], p[5], p[6], p[7]))
schemaString = "timestamp mac userid_ip artist album song id service"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
有了这个我几乎得到了我想要的东西,而且表现非常快。
+-------------------+-----------------+--------------------+-------------------- +--------------------+--------------------+--------------------+-------+
| timestamp| mac| userid_ip| artist| album| song| id|service|
+-------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------+
|2017-01-01 00:00:00|00:00:00:00:00:00|111122 22.235.17...|The United States...| This Is Christmas!|Do You Hear What ...| S1112536|pandora|
|2017-01-01 00:00:00|00:11:11:11:11:11|123123 108.252.2...| NA| Dinner Party Radio| NA| null|pandora|