我有一个文本文件,其中包含扩展名为* .AAA的不同服务器上所有文件的大小我想从每个大于20 GB的服务器中提取文件名+大小。我知道如何从文件中提取一行并显示它,但这是我的例子以及我想要实现的目标。
文件本身的示例:
Pad 1001
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:07 AM 894,889,984 File1.AAA
05/25/2015 07:18 AM 25,673,969,664 File2.AAA
02/11/2016 02:07 AM 17,879,040 File3.AAA
05/25/2015 07:18 AM 12,386,304 File4.AAA
10/13/2008 10:29 AM 1,186,988,032 File3.AAA_oct13
02/15/2016 11:15 AM 2,799,263,744 File5.AAA
6 File(s) 30,585,376,768 bytes
0 Dir(s) 28,585,127,936 bytes free
Pad 1002
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:08 AM 1,379,815,424 File1.AAA
02/11/2016 02:08 AM 18,542,592 File3.AAA
02/15/2016 12:41 AM 853,659,648 File5.AAA
3 File(s) 2,252,017,664 bytes
0 Dir(s) 49,306,902,528 bytes free
以下是我想要的输出Pad#和大于20GB的文件:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA
我最终会把它放在excel电子表格中,但我知道如何。
任何想法?
谢谢
答案 0 :(得分:1)
以下内容可以帮助您入门:
import re
output = []
with open('input.txt') as f_input:
text = f_input.read()
for pad, block in re.findall(r'(Pad \d+)(.*?)(?=Pad|\Z)', text, re.M + re.S):
file_list = re.findall(r'^(.*? +([0-9,]+) +.*?\.AAA\w*?)$', block, re.M)
for line, length in file_list:
length = int(length.replace(',', ''))
if length > 2e10: # Or your choice of what 20GB is
output.append((pad, line))
print output
这将显示一个包含一个元组条目的列表,如下所示:
[('Pad 1001', '05/25/2015 07:18 AM 25,673,969,664 File2.AAA')]
答案 1 :(得分:0)
[编辑]这是我的方法:
import re
result = []
with open('txtfile.txt', 'r') as f:
content = [line.strip() for line in f.readlines()]
for line in content:
m = re.findall('\d{2}/\d{2}/\d{4}\s+\d{2}:\d{2}\s+(A|P)M\s+([0-9,]+)\s+((?!.AAA).)*.AAA((?!.AAA).)*', line)
if line.startswith('Pad') or m and int(m[0][1].replace(',','')) > 20 * 1024 ** 3:
result.append(line)
print re.sub('Pad\s+\d+$', '', ' '.join(result))
输出是:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA