我有以这种格式记录数据:
TIMESTAMP="Jun 7 2010 15:03:49 NZST" ACCESS-TYPE="ABC" TYPE="XYZ" PACKET-
TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-
STATION-ID="LKP" SUB-ID="JIK"
如何使用Python将其读作正确的数据框(行和列)。 列名称为TIMESTAMP,ACCESS-TYPE等。
这只是数据的一个示例行。
答案 0 :(得分:1)
您可以re
将每一行拆分为元组或字典列表。您可以使用它来填充DataFrame
def parse_logfile(log_file_handle):
p = re.compile(r'\s*(.*?)="(.*?)"', )
for line in log_file_handle:
yield p.findall(line)
对于您发布的广告,这会产生
[('TIMESTAMP', 'Jun 7 2010 15:03:49 NZST'),
('ACCESS-TYPE', 'ABC'),
('TYPE', 'XYZ'),
('PACKET-TYPE', 'St'),
('REASON', 'bkz'),
('CIRCUIT-ID', 'UIX eth 1/1/11/20'),
('REMOTE-ID', 'NBC'),
('CALLING-STATION-ID', 'LKP'),
('SUB-ID', 'JIK')]
因此,在代码的另一部分中,您可以执行类似的操作。
with open(log_filename, 'r') as log_file_handle:
log_lines = parse_logfile(log_file_handle)
df = pd.DataFrame()
for line in log_lines:
df = df.append(dict(line), ignore_index=True, )
TEST_DATA
TIMESTAMP="Jun 7 2010 15:03:49 NZST" ACCESS-TYPE="ABC" TYPE="XYZ" PACKET-TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-STATION-ID="LKP" SUB-ID="JIK"
TIMESTAMP="Jun 7 2010 15:03:50 NZST" ACCESS-TYPE1="ABC1" TYPE="XYZ" PACKET-TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-STATION-ID="LKP" SUB-ID="JIK"
TIMESTAMP="Jun 7 2010 15:03:51 NZST" ACCESS-TYPE="ABC2" TYPE="XYZ" PACKET-TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-STATION-ID="LKP" SUB-ID="JIK"
所以我更改了时间戳和访问类型,第二个条目有ACCESS-TYPE1
而不是ACCESS-TYPE
结果
ACCESS-TYPE CALLING-STATION-ID CIRCUIT-ID PACKET-TYPE REASON REMOTE-ID SUB-ID TIMESTAMP TYPE ACCESS-TYPE1
0 ABC LKP UIX eth 1/1/11/20 St bkz NBC JIK Jun 7 2010 15:03:49 NZST XYZ NaN
1 NaN LKP UIX eth 1/1/11/20 St bkz NBC JIK Jun 7 2010 15:03:50 NZST XYZ ABC1
2 ABC2 LKP UIX eth 1/1/11/20 St bkz NBC JIK Jun 7 2010 15:03:51 NZST XYZ NaN
如果所有行具有相同顺序的相同键,则附加应该很容易。如果这在整个文件中发生变化,则可能会变得更加困难。你可以发布更多行吗?
答案 1 :(得分:1)
这是一个很好的简单示例,用于使用pyparsing创建一个小解析器:
import pyparsing as pp
key = pp.Word(pp.alphas, pp.alphas+'-')
EQ = pp.Literal('=').suppress()
value = pp.QuotedString('"')
parser = pp.Dict(pp.OneOrMore(pp.Group(key + EQ + value)))
使用parser
解析输入数据(将单独的行连接成一行,因为您的示例输入会在键的中间划分某些行):
sample = """\
TIMESTAMP="Jun 7 2010 15:03:49 NZST" ACCESS-TYPE="ABC" TYPE="XYZ" PACKET-
TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-
STATION-ID="LKP" SUB-ID="JIK" """
sample = ''.join(sample.splitlines())
# parse the input string
result = parser.parseString(sample)
要获取结果,请使用dict或属性表示法访问结果,或调用dump()以查看键和结构
print(result['PACKET-TYPE'])
print(list(result.keys()))
print(result.TYPE)
print("{TIMESTAMP}/{ACCESS-TYPE}/{CALLING-STATION-ID}".format(**result))
print(result.dump())
打印:
St
['PACKET-TYPE', 'SUB-ID', 'REASON', 'CALLING-STATION-ID', 'ACCESS-TYPE', 'CIRCUIT-ID', 'REMOTE-ID', 'TYPE', 'TIMESTAMP']
XYZ
Jun 7 2010 15:03:49 NZST/ABC/LKP
[['TIMESTAMP', 'Jun 7 2010 15:03:49 NZST'], ['ACCESS-TYPE', 'ABC'], ['TYPE', 'XYZ'], ['PACKET-TYPE', 'St'], ['REASON', 'bkz'], ['CIRCUIT-ID', 'UIX eth 1/1/11/20'], ['REMOTE-ID', 'NBC'], ['CALLING-STATION-ID', 'LKP'], ['SUB-ID', 'JIK']]
- ACCESS-TYPE: 'ABC'
- CALLING-STATION-ID: 'LKP'
- CIRCUIT-ID: 'UIX eth 1/1/11/20'
- PACKET-TYPE: 'St'
- REASON: 'bkz'
- REMOTE-ID: 'NBC'
- SUB-ID: 'JIK'
- TIMESTAMP: 'Jun 7 2010 15:03:49 NZST'
- TYPE: 'XYZ'