基于先前的答案:
似乎可以使用flatMap在单个操作中进行“映射并过滤出错误情况”。
给出示例数据:
spark.read.text("/mnt/seedx-ops-prod/genee-local-datasync/genee-3/genee/logs/genee_python-20190417T075453.005.log").show(4, False)
+---------------------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------------------+
|2019-04-17 07:54:51.505: 2019-04-17 10:54:51 INFO [main.py:64] Read machine_conf.ini |
|2019-04-17 07:54:52.271: 2019-04-17 10:54:52 INFO [app.py:93] Running web server on port 9090|
|2019-04-17 08:05:10.720: 2019-04-17 11:05:10 INFO [app.py:166] Exiting event loop... |
|2019-04-17 08:05:10.720: <_WindowsSelectorEventLoop running=False closed=False debug=False> |
+---------------------------------------------------------------------------------------------+
我希望前三行能够成功解析,然后出现解析错误,不会为第四行产生结果。
def parseTheNonSuckingDaemonPythonLogs(row):
try:
parts = re.findall(r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{1,3}): (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) ([A-Za-z]{1,5}) (.*)', row.value)[0]
return Row(os_ts=parts[0], log_ts=parts[1], log_level=parts[2], message=parts[3])
except:
return Row()
预期结果是
+-----------------------+-------------------+---------+-------------------------------------------+
|os_ts |log_ts |log_level|message |
+-----------------------+-------------------+---------+-------------------------------------------+
|2019-04-17 07:54:51.505|2019-04-17 10:54:51|INFO |[main.py:64] Read machine_conf.ini |
|2019-04-17 07:54:52.271|2019-04-17 10:54:52|INFO |[app.py:93] Running web server on port 9090|
|2019-04-17 08:05:10.720|2019-04-17 11:05:10|INFO |[app.py:166] Exiting event loop... |
+-----------------------+-------------------+---------+-------------------------------------------+
实际结果如下:
genee3_python_logs_text = spark.read.text("/mnt/seedx-ops-prod/genee-local-datasync/genee-3/genee/logs/genee_python-20190417T075453.005.log")
clean_genee3_python_logs = genee3_python_logs_text.rdd.flatMap(parseTheNonSuckingDaemonPythonLogs)
from pyspark.sql import Row
row = Row("val")
genee3_python_logs_df = clean_genee3_python_logs.map(row).toDF()
genee3_python_logs_df.select('*').show(truncate=False)
+-------------------------------------------+
|val |
+-------------------------------------------+
|INFO |
|2019-04-17 10:54:51 |
|[main.py:64] Read machine_conf.ini |
|2019-04-17 07:54:51.505 |
|INFO |
|2019-04-17 10:54:52 |
|[app.py:93] Running web server on port 9090|
|2019-04-17 07:54:52.271 |
|INFO |
|2019-04-17 11:05:10 |
|[app.py:166] Exiting event loop... |
|2019-04-17 08:05:10.720 |
+-------------------------------------------+
答案 0 :(得分:0)
我认为我已经设法使其正常运行,但我仍然不确定是否可以进行功能转换。
def parseTheNonSuckingDaemonPythonLogs(row):
try:
parts = re.findall(r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{1,3}): (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) ([A-Za-z]{1,5}) (.*)', row.value)[0]
return Row(Row(os_ts=parts[0], log_ts=parts[1], log_level=parts[2], message=parts[3]))
except:
return Row()
genee3_python_logs_df = clean_genee3_python_logs.toDF()
genee3_python_logs_df.show(truncate=False)
+---------+-------------------+-------------------------------------------+-----------------------+
|log_level|log_ts |message |os_ts |
+---------+-------------------+-------------------------------------------+-----------------------+
|INFO |2019-04-17 10:54:51|[main.py:64] Read machine_conf.ini |2019-04-17 07:54:51.505|
|INFO |2019-04-17 10:54:52|[app.py:93] Running web server on port 9090|2019-04-17 07:54:52.271|
|INFO |2019-04-17 11:05:10|[app.py:166] Exiting event loop... |2019-04-17 08:05:10.720|
+---------+-------------------+-------------------------------------------+-----------------------+