Question

我正在尝试找到（更快）将父ID映射到Dataframe中子ID的最佳方法。目前，我正在使用NumPy Searchsorted从23,500,000行的文件中查找父母，并且花了很多时间

我有一个需要解析的日志文件，其中包含23,500,000行（大约930MB大小）。文件内容类似于：

2019-05-31 09:00:00.030 DEBUG [OpenAPIImpl.java:253] - com.opm.OpenAPIImpl
<SoapMap>
    <name>returnCode</name>
    <singleValue>0</singleValue>
</SoapMap>
2019-05-31 09:00:00.031 INFO [OtherClass.java:253] - com.opm.OtherClass
<SoapMap>
    <name>returnCode</name>
    <singleValue>1</singleValue>
</SoapMap>

在示例中，第一行显示了日志标题，其中包含以下信息：日期，时间，日志类型，日志文件和日志资源。接下来的几行是日志输出，直到找到另一个具有相应日志输出的日志标题

我需要用相应的标题标识每个日志输出行以进行分析。预期的输出应类似于：

2019-05-31|09:00:00.030|DEBUG|OpenAPIImpl.java:253|com.opm.OpenAPIImpl|<SoapMap>
2019-05-31|09:00:00.030|DEBUG|OpenAPIImpl.java:253|com.opm.OpenAPIImpl|<name>returnCode</name>
2019-05-31|09:00:00.030|DEBUG|OpenAPIImpl.java:253|com.opm.OpenAPIImpl|<singleValue>0</singleValue>
2019-05-31|09:00:00.030|DEBUG|OpenAPIImpl.java:253|com.opm.OpenAPIImpl|</SoapMap>
2019-05-31|09:00:00.031|INFO|OtherClass.java:253|com.opm.OtherClass|<SoapMap>
2019-05-31|09:00:00.031|INFO|OtherClass.java:253|com.opm.OtherClass|<name>returnCode</name>
2019-05-31|09:00:00.031|INFO|OtherClass.java:253|com.opm.OtherClass|<singleValue>1</singleValue>
2019-05-31|09:00:00.031|INFO|OtherClass.java:253|com.opm.OtherClass|</SoapMap>

上面的示例在文件中重复，并且日志输出可以很小（2,3或4行），但可以最大（2,000行）

到目前为止，我的逻辑一直是：

使用sc.textfile读取文件
使用zipWithIndex保留顺序，将索引用作每行的“唯一ID”
用正则表达式检测父母，并将该行标识为PARENT
如果正则表达式不匹配，则将该行标识为CHILD。

EXAMPLE OUTPUT:
1,PARENT,2019-05-31 09:00:00.030 DEBUG [OpenAPIImpl.java:253] - com.opm.OpenAPIImpl
2,CHILD,<SoapMap>
3,CHILD,<name>returnCode</name>
4,CHILD,<singleValue>0</singleValue>
5,CHILD</SoapMap>
6,PARENT,2019-05-31 09:00:00.031 INFO [OtherClass.java:253] - com.opm.OtherClass
7,CHILD,<SoapMap>
8,CHILD,<name>returnCode</name>
9,CHILD,<singleValue>0</singleValue>
10,CHILD</SoapMap>

通过类型==“ PARENT”进行过滤，仅创建另一个父项DF
创建仅由parent_ids组成的数组（排序后的数组）。在上面的示例中，只有索引1和6是父级-> OUTPUT：[1,6]。我正在创建ID的广播变量，以避免在将ID映射到子级时混洗。
过滤主数据框以仅提取CHILD日志并为其分配相应的父ID，如下所示：
- 仅映射子日志的数据框，并使用numpy库中的“ searchsorted”功能通过检查子ID“应插入”的位置并在函数中使用size ='left'参数来搜索父ID。返回一个新行，该行将与主数据框具有相同的列，并具有父ID的新列

EXAMPLE OUTPUT:
1,1,2019-05-31 09:00:00.030 DEBUG [OpenAPIImpl.java:253] - com.opm.OpenAPIImpl
2,1,<SoapMap>
3,1,<name>returnCode</name>
4,1,<singleValue>0</singleValue>
5,1,</SoapMap>
6,6,2019-05-31 09:00:00.031 INFO [OtherClass.java:253] - com.opm.OtherClass
7,6,<SoapMap>
8,6,<name>returnCode</name>
9,6,<singleValue>0</singleValue>
10,6,</SoapMap>

我尝试使用更多的执行程序，更少的执行程序，更多核心节点，等等。但是此过程的这一部分需要花费更多时间。

我用于将父母映射到孩子的代码如下：

#here i'm extracting only the parent ids from the parent_ dataframe
list_parent_ids = df_parent_logs.collect()
list_parent_ids = list( map( lambda record: record["parent_id"], list_parent_ids ) )
# here I'm broadcasting the list of parents ids, to the executors
list_parent_ids_bcast = sc.broadcast(list_parent_ids)
# here is the function to map parent ID's to child ID's
def mapParents(child_record):
    list_parent_ids = list_parent_ids_bcast.value
    idx = numpy.searchsorted(list_parent_ids, child_record["row_id"], side='left')
    # this case will evaluate if index returned by the searchsorted is at the final, which will be the array length. I substract 1, to avoid index out of range
    if idx >= len(list_parent_ids):
        idx = len(list_parent_ids) - 1
    # this case will evaluate if the parent_id is the same as the child_id, so in this case, it belongs to a parent log and we don't do anything
    elif child_record["row_id"] == list_parent_ids[idx]:
        pass
    # this case will evaluate middle indexes, so it will substract 1 because the parent is the previous and not the after index.
    else:
        idx = idx - 1
    parent_id = list_parent_ids[idx]
    return Row(child_record["row_id"],parent_id,child_record["log_text"])

rdd_child_logs = df_logs.rdd.map( lambda child_record: fillParents(child_record) )

目前，这种逻辑需要花费大量时间。处理整个文件（23,500,000行-930MB）大约需要18个小时，而我认为这是处理大量数据的时间。

我正在寻找优化此建议。任何可以帮助我改善此问题的建议，将不胜感激。

我当然可以要求将文件拆分为较小的文件，但是我想知道在返回客户并要求他进行拆分之前，是否还有其他方法可以帮助我改进此文件。

建议在Pyspark数据框中将父行映射到子行的最快策略

0 个答案: