Python从文件中提取信息

时间:2016-02-22 14:50:06

标签: python

我有一个文本文件,其中包含扩展名为* .AAA的不同服务器上所有文件的大小我想从每个大于20 GB的服务器中提取文件名+大小。我知道如何从文件中提取一行并显示它,但这是我的例子以及我想要实现的目标。

文件本身的示例:

Pad 1001
 Volume in drive \\192.168.0.101\c$ has no label.
 Volume Serial Number is XXXX-XXXX

 Directory of \\192.168.0.101\c$\TESTUSER\

02/11/2016  02:07 AM       894,889,984 File1.AAA
05/25/2015  07:18 AM    25,673,969,664 File2.AAA
02/11/2016  02:07 AM        17,879,040 File3.AAA
05/25/2015  07:18 AM        12,386,304 File4.AAA
10/13/2008  10:29 AM     1,186,988,032 File3.AAA_oct13
02/15/2016  11:15 AM     2,799,263,744 File5.AAA
               6 File(s) 30,585,376,768 bytes
               0 Dir(s)  28,585,127,936 bytes free
Pad 1002
 Volume in drive \\192.168.0.101\c$ has no label.
 Volume Serial Number is XXXX-XXXX

 Directory of \\192.168.0.101\c$\TESTUSER\

02/11/2016  02:08 AM     1,379,815,424 File1.AAA
02/11/2016  02:08 AM        18,542,592 File3.AAA
02/15/2016  12:41 AM       853,659,648 File5.AAA
               3 File(s)  2,252,017,664 bytes
               0 Dir(s)  49,306,902,528 bytes free

以下是我想要的输出Pad#和大于20GB的文件:

Pad 1001 05/25/2015  07:18 AM    25,673,969,664 File2.AAA

我最终会把它放在excel电子表格中,但我知道如何。

任何想法?

谢谢

2 个答案:

答案 0 :(得分:1)

以下内容可以帮助您入门:

import re

output = []

with open('input.txt') as f_input:
    text = f_input.read()

for pad, block in re.findall(r'(Pad \d+)(.*?)(?=Pad|\Z)', text, re.M + re.S):
    file_list = re.findall(r'^(.*? +([0-9,]+) +.*?\.AAA\w*?)$', block, re.M)

    for line, length in file_list:
        length = int(length.replace(',', ''))

        if length > 2e10:       # Or your choice of what 20GB is
            output.append((pad, line))

print output

这将显示一个包含一个元组条目的列表,如下所示:

[('Pad 1001', '05/25/2015  07:18 AM    25,673,969,664 File2.AAA')]

答案 1 :(得分:0)

[编辑]这是我的方法:

import re

result = []
with open('txtfile.txt', 'r') as f:
    content = [line.strip() for line in f.readlines()]
for line in content:
    m = re.findall('\d{2}/\d{2}/\d{4}\s+\d{2}:\d{2}\s+(A|P)M\s+([0-9,]+)\s+((?!.AAA).)*.AAA((?!.AAA).)*', line)
    if line.startswith('Pad') or m and int(m[0][1].replace(',','')) > 20 * 1024 ** 3:
        result.append(line)
print  re.sub('Pad\s+\d+$', '', ' '.join(result))

输出是:

Pad 1001 05/25/2015  07:18 AM    25,673,969,664 File2.AAA