解析格式不正确的日志文件,其中记录是多行而没有设置行数

时间:2015-07-12 03:00:12

标签: python csv text-parsing

我需要解析一堆巨大的文本文件,每个文件都是100MB +。它们是CSV格式的格式不佳的日志文件,但每条记录都是多行,所以我不能只读取每一行并用分隔符分隔它们。它也不是一定数量的行,因为如果有空白值,那么有时会跳过这一行或者某些行溢出到下一行。记录分隔符也可以在同一个文件中更改,从“”到“*****”,有时会出现一行“日志结束#”

示例日志:

"Date:","6/23/2015","","Location:","Kol","","Target Name:","ILO.sed.908"
"ID:","ke.lo.213"
"User:","EDU\namo"
"Done:","Edit File"
"Comment","File saved successfully"
""
"Date:","6/27/2015","","Location:","Los Angeles","","Target Name:","MAL.21.ol.lil"
"ID:","uf.903.124.56"
"Done:","dirt emptied and driven to locations without issue, yet to do anyt"
"hing with the steel pipes, no planks "
"Comment"," l"
""
"end of log 1"
"Date:","5/16/2015","","Location:","Springfield","","Target Name:","ile.s.ol.le"
"ID:","84l.df.345"
"User:","EDU\bob2"
"Done:","emptied successfully"
"Comment","File saved successfully"
" ******* "

我该如何处理?它需要高效,以便我可以快速处理它,因此更少的文件操作会很好。我现在只是一次把它读入内存:

with open('Path/to/file', 'r') as content_file:
    content = content_file.read()

我对python也有点新手,我知道如何处理读取多个文件并在每个文件上运行代码,并且我有一个toString将其输出到新的csv文件中。

另一个问题是,一些日志文件的大小只有几GB,而且不会同时将所有日志文件读入内存,但我不知道如何将其分成块。我不能只读取X行数,因为没有设置记录行数。

需要将注释保存在一个字符串中并连接在一起。

所以请帮忙!

2 个答案:

答案 0 :(得分:0)

我注意到每个日志条目都以" Date"开头。行和结束"完成"其次是"评论"线。所以,不要担心分隔符,你可以阅读" Date"排在"评论"并将其视为一个日志块。

" log"消息似乎并不重要,但如果你真的想抓住它,你可以在连续两次"日期"之间抓住所有内容。行,这将是一个日志块。

我在上面发布了一个关于如何以块的形式加载文件的链接。块越大,您需要做的I / O就越少,但这也意味着由于加载了更大的块,您会在内存中占据一席之地。

答案 1 :(得分:0)

要处理大文件,您应该使用以下事实:文件是在Python中逐行返回的迭代器:

with open('Path/to/file', 'r') as content_file:
    for line in content_file:
         # your code

Python CVS library也使用此功能。 lib可能很有用。