Question

在尝试将.txt文件解析为csv / xlsx格式的情况时，我对python很陌生。我面临下面的问题。

要解析的代码（html到txt）：

import requests as req
import re

resp = req.get("my_server_url")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
with open("Output.txt", "w") as text_file:
   print("Purchase Amount: {}".format(stripped), file=text_file)
print(stripped)

解析为文本文件后，我得到.txt格式的以下输出：

Servicegroup 'app_service' Host State Breakdowns:


Host% Time Up% Time Down% Time Unreachable% Time Undetermined
sever1.domain.com:1717100.000% (100.000%)0.000% (0.000%)0.000%   (0.000%)0.000%
sever2.domain.com:1717100.000% (100.000%)0.000% (0.000%)0.000%  (0.000%)0.000%
sever3.domain.com:1717100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000%
Average100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000%


Servicegroup 'app_service' Service State Breakdowns:


HostService% Time OK% Time Warning% Time Unknown% Time Critical% Time  Undetermined
sever1.domain.com:1717app_availability_check0.000% (0.000%)0.000%    (0.000%)0.000% (0.000%)100.000% (100.000%)0.000%
app_data_size_check0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)100.000%  (100.000%)0.000%
app_hitrate_check0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)100.000% (100.000%)0.000%
app_log_size_check0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)100.000% (100.000%)0.000%
app_sessions_check0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)100.000% (100.000%)0.000%
sever2.domain.com:1717app_availability_check100.000% (100.000%)0.000%    (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
app_data_size_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
app_hitrate_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
app_log_size_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
app_sessions_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
sever3.domain.com:1717app_availability_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
app_data_size_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
app_hitrate_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
app_log_size_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
app_sessions_check100.000% (100.000%)0.000% (0.000%)0.000% (0.000%)0.000% (0.000%)0.000%
Average87.500% (87.500%)0.000% (0.000%)0.000% (0.000%)12.500% (12.500%)0.000%

从上面的输出中我需要在excel中解析相同的列，并在各列

中使用以下值

column 1               column 2       column 3                                      column 4
Host%                  Time Up%       HostService%                                  Time OK% 
sever1.domain.com:1717 100.000%       sever1.domain.com:1717app_availability_check  0.000%
sever2.domain.com:1717 100.000%       sever2.domain.com:1717app_availability_check  100.000%
sever3.domain.com:1717 100.000%       sever3.domain.com:1717app_availability_check  100.000%

有没有办法将这些特定数据导入csv / excel？任何帮助表示赞赏

Answer 1

您需要对正则表达式进行更好的数据清理或更具包容性。

一般来说，当你还要处理时，你不想把“东西”分成“没什么”。当您剥离HTML标记时，您会留下相互接触的数据。相反，您希望使用分隔符，以便稍后在脚本中进一步细分。

考虑，

import string

DELIM = '||'

stripped = re.sub('<[^<]+?>', DELIM, content)
lines = []

for line in stripped.split():
    line = line.strip('|')

    # make sure it's an actual data line
    if line[0] in string.ascii_uppercase and line[-1] == '%':
        continue

    columns = [c.strip('%() ') for c in line.split(DELIM)]

    lines.append(columns)

..
..
# write lines to csv via `csv` module

现在，我们可以确定每行是否是实际数据，因为主机名以小写字母开头，最后一列是百分比。然后，我们可以再次拆分每一行从每列中删除不必要的字符。

最后，我们留下了一个元组列表，其中列表中的每个元素都是一行，而行中的每个元素都是一个单元格。

重要的是要记住str.strip将以任何顺序删除任何字符，只要它们是最外层的值之一。

x = 'abcdefg00001111!!'
print(x.strip('cab0!1'))
..
.. output:
defg

在python中解析csv / xlsx中的.txt文件

1 个答案: