我正在尝试找到(更快)将父ID映射到Dataframe中子ID的最佳方法。目前,我正在使用NumPy Searchsorted从23,500,000行的文件中查找父母,并且花了很多时间
我有一个需要解析的日志文件,其中包含23,500,000行(大约930MB大小)。文件内容类似于:
2019-05-31 09:00:00.030 DEBUG [OpenAPIImpl.java:253] - com.opm.OpenAPIImpl
<SoapMap>
<name>returnCode</name>
<singleValue>0</singleValue>
</SoapMap>
2019-05-31 09:00:00.031 INFO [OtherClass.java:253] - com.opm.OtherClass
<SoapMap>
<name>returnCode</name>
<singleValue>1</singleValue>
</SoapMap>
在示例中,第一行显示了日志标题,其中包含以下信息:日期,时间,日志类型,日志文件和日志资源。接下来的几行是日志输出,直到找到另一个具有相应日志输出的日志标题
我需要用相应的标题标识每个日志输出行以进行分析。预期的输出应类似于:
2019-05-31|09:00:00.030|DEBUG|OpenAPIImpl.java:253|com.opm.OpenAPIImpl|<SoapMap>
2019-05-31|09:00:00.030|DEBUG|OpenAPIImpl.java:253|com.opm.OpenAPIImpl|<name>returnCode</name>
2019-05-31|09:00:00.030|DEBUG|OpenAPIImpl.java:253|com.opm.OpenAPIImpl|<singleValue>0</singleValue>
2019-05-31|09:00:00.030|DEBUG|OpenAPIImpl.java:253|com.opm.OpenAPIImpl|</SoapMap>
2019-05-31|09:00:00.031|INFO|OtherClass.java:253|com.opm.OtherClass|<SoapMap>
2019-05-31|09:00:00.031|INFO|OtherClass.java:253|com.opm.OtherClass|<name>returnCode</name>
2019-05-31|09:00:00.031|INFO|OtherClass.java:253|com.opm.OtherClass|<singleValue>1</singleValue>
2019-05-31|09:00:00.031|INFO|OtherClass.java:253|com.opm.OtherClass|</SoapMap>
上面的示例在文件中重复,并且日志输出可以很小(2,3或4行),但可以最大(2,000行)
到目前为止,我的逻辑一直是:
EXAMPLE OUTPUT:
1,PARENT,2019-05-31 09:00:00.030 DEBUG [OpenAPIImpl.java:253] - com.opm.OpenAPIImpl
2,CHILD,<SoapMap>
3,CHILD,<name>returnCode</name>
4,CHILD,<singleValue>0</singleValue>
5,CHILD</SoapMap>
6,PARENT,2019-05-31 09:00:00.031 INFO [OtherClass.java:253] - com.opm.OtherClass
7,CHILD,<SoapMap>
8,CHILD,<name>returnCode</name>
9,CHILD,<singleValue>0</singleValue>
10,CHILD</SoapMap>
EXAMPLE OUTPUT:
1,1,2019-05-31 09:00:00.030 DEBUG [OpenAPIImpl.java:253] - com.opm.OpenAPIImpl
2,1,<SoapMap>
3,1,<name>returnCode</name>
4,1,<singleValue>0</singleValue>
5,1,</SoapMap>
6,6,2019-05-31 09:00:00.031 INFO [OtherClass.java:253] - com.opm.OtherClass
7,6,<SoapMap>
8,6,<name>returnCode</name>
9,6,<singleValue>0</singleValue>
10,6,</SoapMap>
我尝试使用更多的执行程序,更少的执行程序,更多核心节点,等等。但是此过程的这一部分需要花费更多时间。
我用于将父母映射到孩子的代码如下:
#here i'm extracting only the parent ids from the parent_ dataframe
list_parent_ids = df_parent_logs.collect()
list_parent_ids = list( map( lambda record: record["parent_id"], list_parent_ids ) )
# here I'm broadcasting the list of parents ids, to the executors
list_parent_ids_bcast = sc.broadcast(list_parent_ids)
# here is the function to map parent ID's to child ID's
def mapParents(child_record):
list_parent_ids = list_parent_ids_bcast.value
idx = numpy.searchsorted(list_parent_ids, child_record["row_id"], side='left')
# this case will evaluate if index returned by the searchsorted is at the final, which will be the array length. I substract 1, to avoid index out of range
if idx >= len(list_parent_ids):
idx = len(list_parent_ids) - 1
# this case will evaluate if the parent_id is the same as the child_id, so in this case, it belongs to a parent log and we don't do anything
elif child_record["row_id"] == list_parent_ids[idx]:
pass
# this case will evaluate middle indexes, so it will substract 1 because the parent is the previous and not the after index.
else:
idx = idx - 1
parent_id = list_parent_ids[idx]
return Row(child_record["row_id"],parent_id,child_record["log_text"])
rdd_child_logs = df_logs.rdd.map( lambda child_record: fillParents(child_record) )
目前,这种逻辑需要花费大量时间。处理整个文件(23,500,000行-930MB)大约需要18个小时,而我认为这是处理大量数据的时间。
我正在寻找优化此建议。任何可以帮助我改善此问题的建议,将不胜感激。
我当然可以要求将文件拆分为较小的文件,但是我想知道在返回客户并要求他进行拆分之前,是否还有其他方法可以帮助我改进此文件。