Question

我对熊猫很新。我有一个日志文本文件。我试图从文件中获取少量数据点。下面是那种获取所需数据但不是所需格式的代码。我希望Pandas数据框有两列。

import os
from collections import Counter
import pandas as pd
#print(os.getcwd())
infile = "myfile.txt"

important = []
keep_phrases = ["Host",
              "User-Agent"
              ]

with open(infile) as f:
    f = f.readlines()

for line in f:
    for phrase in keep_phrases:
        if phrase in line:
            important.append(line)

            break
#print(type(important))
print(important)
#Counter(important)
pd.DataFrame(important)

这不会给我输出两列。我正在寻找主机和用户代理一行。

文本文件示例如下

   15 SessionOpen  c aa.bb.cc.ddd 62667 :8080
   15 SessionClose c pipe
   15 ReqStart     c aa.bb.cc.ddd 62667 442374415
   15 RxURL        c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag666
   15 RxHeader     c Host: ll.abrstream.channel4.com
   15 RxHeader     c Connection: keep-alive
   15 RxHeader     c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
   15 RxHeader     c X-Requested-With: ShockwaveFlash/21.0.0.216
   15 RxHeader     c Accept: */*
   15 RxHeader     c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
   15 RxHeader     c Accept-Encoding: gzip, deflate, sdch
   15 RxHeader     c Accept-Language: en-US,en;q=0.8
   15 ReqEnd       c 442374415 1461870946.496117592 1461870947.112555504 0.000315428 0.001363039 0.615074873
   15 SessionOpen  c aa1.bb1.cc1.ddd1 59409 :8080
   15 SessionClose c pipe
   15 ReqStart     c aa1.bb1.cc1.ddd1 59409 442374416
   15 RxURL        c /gpsApi.php
   15 RxHeader     c Content-Length: 0
   15 RxHeader     c Host: map.yanue.net
   15 RxHeader     c Connection: Keep-Alive
   15 RxHeader     c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
   15 ReqEnd       c 442374416 1461870950.580444574 1461870951.139206648 0.000064135 0.001196861 0.557565212
   15 SessionOpen  c aa1.bb1.cc1.ddd1 52179 :8080
   15 SessionClose c pipe
   15 ReqStart     c aa1.bb1.cc1.ddd1 52179 442374417
   15 RxURL        c /gpsApi.php
   15 RxHeader     c Content-Length: 0
   15 RxHeader     c Host: map.yanue.net
   15 RxHeader     c Connection: Keep-Alive
   15 RxHeader     c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
   15 ReqEnd       c 442374417 1461870951.776547432 1461870952.448071241 0.000062943 0.001109123 0.670414686
   18 SessionOpen  c aa.bb.cc.ddd 62670 :8080
   18 SessionClose c pipe
   18 ReqStart     c aa.bb.cc.ddd 62670 442374418
   18 RxURL        c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag667
   18 RxHeader     c Host: ll.abrstream.channel4.com
   18 RxHeader     c Connection: keep-alive
   18 RxHeader     c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
   18 RxHeader     c X-Requested-With: ShockwaveFlash/21.0.0.216
   18 RxHeader     c Accept: */*
   18 RxHeader     c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
   18 RxHeader     c Accept-Encoding: gzip, deflate, sdch
   18 RxHeader     c Accept-Language: en-US,en;q=0.8
   18 ReqEnd       c 442374418 1461870951.920178175 1461870952.507097483 0.001731873 0.001337051 0.585582256
   15 SessionOpen  c aa1.bb1.cc1.ddd1 48034 :8080
   15 SessionClose c pipe

Answer 1

您可以通过创建列表列表来创建数据框，然后使用数据框构造函数。

循环遍历文件的每一行，就像您开始做的那样，然后将每一行拆分为不同的列。您可以使用re.split创建列的列表，限制最大拆分数以将最后一列视为一个元素。或者，如果您知道每个元素总是以相同的方式对齐，则可以使用切片来创建该列表。

import re

df_list = []
with open(infile) as f:
    for line in f:
        # remove whitespace at the start and the newline at the end
        line = line.strip()
        # split each column on whitespace
        columns = re.split('\s+', line, maxsplit=4)
        df_list.append(columns)

然后，您可以使用this answer中的方法创建数据框。

df = pd.DataFrame(df_list)

将空格对齐的文本文件转换为Pandas DataFrame

1 个答案: