我正在为日志记录分析寻找文本分类模型。
挑战在于,每个音符都不能包含“自然语言”文字。例如,一些注释是带有符号的线程回溯输出,一些注释是来自源代码的日志信息。在这些说明中,有一些描述客户如何使用我们产品的说明是我们要分类的。
我可以将任何ML模型或方法用于此文本分类吗?
下面是一些不同注释的示例(我更改了一些内容,因此没有显示公司机密材料):
回溯信息开发人员已粘贴以进行错误分析:
func118 4563453 344 = SYSTEM_FUNC_1 0x00000efa34343 0x0000000009f333a0 0xffe3ebdfd700 <<<<<
Total of 1 API working thread(s)
(gdb) thread find 0x123456
Thread 670 has target id 'Thread 0x123456 (LWP 443)'
(gdb) t 670
[Switching to thread 670 (Thread 0x123456 (LWP 443))]
#0 0x35353453563abcd in __lock_func1_ ()
from /disks/folder1/xxx/xxx_folder1/info_folder/info2_dir/lib64/libpthread.so.0
(gdb) ebt
#0 __lock_func1_()
#1 _LOCK_F_10()
#2 func_mod_4()
#3 func_mod_5()
#4 ModCon::disconnect()
#5 ModCon::abort()
#6 ModServ::disconnect()
#7 ModServManager::disconnect()
#8 mod1::func1()
#9 mod1::func2()
用于问题分析的产品日志:
cpu/MOD/MOD2/log/
start_mod.log:
Thu Dec 24 00:01:12 UTC 2019 FUN: HG: FILE_A: stopping
Thu Dec 24 00:01:12 UTC 2019 FUN: FILE_A: stopping, timeout -22-
Thu Dec 24 00:01:12 UTC 2019 system-state: cleared FILE_A_start_complete
Thu Dec 24 00:01:12 UTC 2019 FUN: FILE_A: run thread still running: con_b.pl FUN_run 0
Thu Dec 24 00:01:12 UTC 2019 FUN: FILE_A: calling con_b.pl FUN_cleanup 0, time left: -160-
Thu Dec 24 00:01:12 2019 cli: con_a.pl: FUN_cleanup for FILE_A
Thu Dec 24 00:01:12 2019 cmd: con_a.pl: sp got xxx error, will try to act_xxx
Thu Dec 24 00:01:13 UTC 2019 FUN: FILE_A: action 1
Thu Dec 24 00:01:13 UTC 2019 FUN: FILE_A: action 1 complete
Thu Dec 24 00:01:13 UTC 2019 FUN: FILE_A: action 2
与客户有关的配置信息(这是我要分类并从所有注释中撤回的最感兴趣的注释):
Customer xxx has created func_xxx to protect their data,
they also perform daily backup of their data by using func_xxx2.
They totally created xxx3 objects in each node...