Question

我正在尝试从日志文件中删除特殊字符。这是两个示例行：

2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:;
2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials

删除特殊字符后的输出：

2016.04.03  23  54  48.957  213.210.213.316  PDL3_SGW2  5F6DB03A    093E    0D414D9C    1   1   userId  1000    http    live.skysat.tv  cmdc    services    region  25351   lang    swe count   250 sort    2blogicalChannelNumber  101 0   250                                                                     

2016.04.03  23  54  48.958  781.69.243.363  PDL3_SGW2   userId  1001    http    live.skysat.tv  cmdc    services    region  25351   lang    swe count   250 sort    2blogicalChannelNumber  101 0   1   0xDC40001   Invalid credentials

正如您在输出的第二行中看到的那样，“userId”位于列[6]下而不是列[11]。由于缺少日志文件中列[06]到列[10]的数据。我想处理这个并写出所有列，即使日志文件中没有数据也是如此。

输出应如下：

2016.04.03  23  54  48.957  213.210.213.316  PDL3_SGW2  5F6DB03A    093E    0D414D9C    1   1   userId  1000    http    live.skysat.tv  cmdc    services    region  25351   lang    swe count   250 sort    2blogicalChannelNumber  101 0   250                                                                     

2016.04.03  23  54  48.958  781.69.243.363  PDL3_SGW2                                           userId  1001    http    live.skysat.tv  cmdc    services    region  25351   lang    swe count   250 sort    2blogicalChannelNumber  101 0   1   0xDC40001   Invalid credentials

这是我的代码部分：

new_str = re.sub(r'[- - [ " / : ; & ? = % ~ + \n \]]', ' ', line)
text = new_str.rstrip().split()
writer.writerow(text)

Answer 1

这适用于您发布的两行：

import re

lines = ["2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:;",
         "2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials"]

def adjust_columns(list_of_lines):
    widest = [max(len(el) for el in column) for column in zip(*list_of_lines)]
    return [ " ".join("{{:<{}s}}".format(widest[i]).format(e)
             for i,e in enumerate(line)) for line in list_of_lines ]

r = re.compile('[ /:;&?=%~+-]')
list_of_lines = [[r.split(el) for el in line.split(';:;')] for line in lines]
list_of_columns = [  all(len(el) == len(col[0]) for el in col)
                     and  adjust_columns(col)
                     or   [" ".join(el) for el in col]
                     for col in zip(*list_of_lines) ]
text = "\n".join(adjust_columns(list(zip(*list_of_columns))))
print(text)

这假设;:;始终是字段的分隔符。代码将每一行拆分为字段。然后，每个字段再次以特殊字符分割。如果列中的每个字段包含相同数量的特殊字符，则会调整该列中的子字段的宽度并按空格连接。最后一步是调整每列的宽度。

一个问题可能是，您不能再逐行处理输入，因为您必须找到每列的最长条目。

如果您不需要调整子字段（如示例所示），您可以使用这个更简单的代码：

r = re.compile('[ /:;&?=%~+-]')
list_of_lines = [[" ".join(r.split(el)) for el in line.split(';:;')] for line in lines]
text = "\n".join(adjust_columns(list_of_lines))

Answer 2

>>> from pprint import pprint

让我们使用字符串列表模拟数据文件......

>>> lines = [
    '2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:;',
    '2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials']

在官方文档中，您可以使用字符串方法S.split(sep)返回S中单词的列表，使用sep作为 分隔符字符串< / em> （重点是我的）。

在您的情况下，分隔符是字符串';:;'，因此您可以执行

>>> data = [line.split(';:;') for line in lines]

data现在是一个列表列表，每个子列表包含文件中缺少字段的空字符串。

>>> pprint(data) [['2016.04.03 23:54:28.257', '213.210.213.316', 'PDL3_SGW2', '5F6DBA-093E-0D4D9C-00000001-01', 'userId', '', '1000', 'http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber', '101', '0', '250', '', ''], ['2016.04.03 23:54:28.258', '781.69.243.363', 'PDL3_SGW2', '', 'userId', '', '1001', 'http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber', '101', '0', '1', '0x40001', 'Invalid credentials']]

您可以循环数据并以您最喜欢的方式输出每组字段，例如，

>>> for record in data: output(record) >>>

那就是全部。

P.S。根据您的需求，output()是您必须定义的功能。

删除Python中的特殊字符

2 个答案: