Question

我正在寻找将日志文件中的两种常见模式加载到pandas数据框中的标准pythonic方法。

跨多行的记录：

=REPORT==== 26-Jun-2018::18:30:00 ===
    column_1: some data
    column_2: {'maybe': 'json or something'}

=REPORT==== 26-Jun-2018::19:30:00 ===
    column_1: some data
    column_2: {'maybe': 'json or something',
               'and': 'maybe spanning multiple lines'}

可能跨越多行的记录：

2018-01-09 20:12:38,020 INFO logname: Examining 6668121 database
2018-01-09 20:13:00,020 ERROR logname: Caught an Exception
    Traceback (most recent call last):
    File "test.py", line 1, in __main__
        None.do_the_thing()
    AttributeError: 'NoneType' object has no attribute 'getDatabase'

对于第一个示例，我希望获得一个包含['timestamp'，'column_1'，'column_2']之类的列的数据框

第二个是['timestamp'，'log_level'，'logname'，'message text']

我非常确定，除了每行的末尾，还有一种方法可以为每条记录表示分隔符，并且为每条记录表示内部定界符。

Answer 1

我不认为熊猫确实有开箱即用的方式来完成您想要的事情。

以下是从Doc on pandas I/O methods读取数据帧的可用方法：

Format Type   Data Description    Reader      Writer
text          CSV                 read_csv    to_csv
text          JSON                read_json   to_json
text          HTML                read_html   to_html
text          Local clipboard     read_clipboard  to_clipboard
binary        MS Excel            read_excel  to_excel
binary        HDF5 Format         read_hdf    to_hdf
binary        Feather Format      read_feather    to_feather
binary        Parquet Format      read_parquet    to_parquet
binary        Msgpack             read_msgpack    to_msgpack
binary        Stata               read_stata      to_stata
binary        SAS                 read_sas     
binary        Python Pickle Format    read_pickle to_pickle
SQL           SQL                 read_sql    to_sql
SQL           Google Big Query    read_gbq    to_gbq

您的示例均不遵循文本格式的规则：csv，html或json-它们有点混搭了多种格式。更复杂的是，元素分隔符和行分隔符在行与行之间都不同。

来自read_csv doc：

如果您想使用正则表达式来处理复杂的列分隔符，那么这将迫使熊猫使用python引擎：

sep：str，默认为“，”

要使用的分隔符。 [...]此外，分隔符的长度比   1个字符，与'\ s +'不同，将被解释为常规字符   表达式，还将强制使用Python解析引擎。   注意，正则表达式定界符易于忽略引用的数据。正则表达式   例如：'\ r \ t'

lineterminator arg只能与C解析器一起使用，并且不能用作正则表达式：

换行符：str（长度1），默认为无

用于将文件分成几行的字符。仅对C解析器有效。

您可能会写自己的解析器，这很不理想，因为这样的事情很容易出错。

熊猫：载入多行记录

1 个答案: