flatMap在DataFrame中创建多行

时间:2019-07-20 12:10:05

标签: apache-spark pyspark apache-spark-sql pyspark-sql

基于先前的答案:

似乎可以使用flatMap在单个操作中进行“映射并过滤出错误情况”。

给出示例数据:

spark.read.text("/mnt/seedx-ops-prod/genee-local-datasync/genee-3/genee/logs/genee_python-20190417T075453.005.log").show(4, False)

+---------------------------------------------------------------------------------------------+
|value                                                                                        |
+---------------------------------------------------------------------------------------------+
|2019-04-17 07:54:51.505: 2019-04-17 10:54:51 INFO [main.py:64] Read machine_conf.ini         |
|2019-04-17 07:54:52.271: 2019-04-17 10:54:52 INFO [app.py:93] Running web server on port 9090|
|2019-04-17 08:05:10.720: 2019-04-17 11:05:10 INFO [app.py:166] Exiting event loop...         |
|2019-04-17 08:05:10.720: <_WindowsSelectorEventLoop running=False closed=False debug=False>  |
+---------------------------------------------------------------------------------------------+

我希望前三行能够成功解析,然后出现解析错误,不会为第四行产生结果。

def parseTheNonSuckingDaemonPythonLogs(row):
  try:
    parts = re.findall(r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{1,3}): (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) ([A-Za-z]{1,5}) (.*)', row.value)[0]
    return Row(os_ts=parts[0], log_ts=parts[1], log_level=parts[2], message=parts[3])
  except:
    return Row()

预期结果是

+-----------------------+-------------------+---------+-------------------------------------------+
|os_ts                  |log_ts             |log_level|message                                    |
+-----------------------+-------------------+---------+-------------------------------------------+
|2019-04-17 07:54:51.505|2019-04-17 10:54:51|INFO     |[main.py:64] Read machine_conf.ini         |
|2019-04-17 07:54:52.271|2019-04-17 10:54:52|INFO     |[app.py:93] Running web server on port 9090|
|2019-04-17 08:05:10.720|2019-04-17 11:05:10|INFO     |[app.py:166] Exiting event loop...         |
+-----------------------+-------------------+---------+-------------------------------------------+

实际结果如下:

genee3_python_logs_text = spark.read.text("/mnt/seedx-ops-prod/genee-local-datasync/genee-3/genee/logs/genee_python-20190417T075453.005.log")

clean_genee3_python_logs = genee3_python_logs_text.rdd.flatMap(parseTheNonSuckingDaemonPythonLogs)

from pyspark.sql import Row

row = Row("val")
genee3_python_logs_df = clean_genee3_python_logs.map(row).toDF()
genee3_python_logs_df.select('*').show(truncate=False)

+-------------------------------------------+
|val                                        |
+-------------------------------------------+
|INFO                                       |
|2019-04-17 10:54:51                        |
|[main.py:64] Read machine_conf.ini         |
|2019-04-17 07:54:51.505                    |
|INFO                                       |
|2019-04-17 10:54:52                        |
|[app.py:93] Running web server on port 9090|
|2019-04-17 07:54:52.271                    |
|INFO                                       |
|2019-04-17 11:05:10                        |
|[app.py:166] Exiting event loop...         |
|2019-04-17 08:05:10.720                    |
+-------------------------------------------+

1 个答案:

答案 0 :(得分:0)

我认为我已经设法使其正常运行,但我仍然不确定是否可以进行功能转换。

在解析逻辑中将行包装在另一行中:

def parseTheNonSuckingDaemonPythonLogs(row):
  try:
    parts = re.findall(r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{1,3}): (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) ([A-Za-z]{1,5}) (.*)', row.value)[0]
    return Row(Row(os_ts=parts[0], log_ts=parts[1], log_level=parts[2], message=parts[3]))
  except:
    return Row()

将“行”适应拖放到DataFrame声明中:

genee3_python_logs_df = clean_genee3_python_logs.toDF()

结果:

genee3_python_logs_df.show(truncate=False)

+---------+-------------------+-------------------------------------------+-----------------------+
|log_level|log_ts             |message                                    |os_ts                  |
+---------+-------------------+-------------------------------------------+-----------------------+
|INFO     |2019-04-17 10:54:51|[main.py:64] Read machine_conf.ini         |2019-04-17 07:54:51.505|
|INFO     |2019-04-17 10:54:52|[app.py:93] Running web server on port 9090|2019-04-17 07:54:52.271|
|INFO     |2019-04-17 11:05:10|[app.py:166] Exiting event loop...         |2019-04-17 08:05:10.720|
+---------+-------------------+-------------------------------------------+-----------------------+