处理阅读表中的空白空间从文本文件python

时间:2014-11-05 07:26:03

标签: python split text-parsing missing-data freetexttable

我需要解析文件,如下所示。 http://bit.ly/1x6yzoX

我写了这个fallowing方法来解析这个文件,但是无法读取最新年份(2014)的不完整数据,这些数据在文本文件表中空了空格。 现在我正在跳过我无法阅读的内容。

帮助我开始如何处理这个问题?。

LINES_TO_IGNORE = 7
import collections
import csv

def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with open(data_file) as f:
        reader = csv.reader(f, delimiter="\t")
        data = islice(reader, LINES_TO_IGNORE, None, None)
        if not data:
            return result_dict
        # Get file headers
        headers = data.next()
        headers = headers[0].split()
        keys = headers[1:]

        for row in data:
            values = row[0].split()
            if len(headers) == len(values):
                year = parse_to_int(values[0])
                data_list = [parse_to_float(x) for x in values[1:]]
                # Each line becomes a dict (column_header->value)
                data_dict = collections.OrderedDict(zip(keys, data_list))
            else:
                print "Skipping"
            # result_dict is dict of dict (year->data_dict)
            result_dict[year] = data_dict
    return result_dict

3 个答案:

答案 0 :(得分:1)

您可以使用Pandas轻松完成:

import pandas as pd
data = pd.read_fwf('UK.txt', skiprows=7, delimiter=' ')

使用print data[-3:]打印最后几行:

    Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT  \
102  2012    1.8    1.2    3.4    2.5    6.0    8.8...
103  2013    1.0   -0.1   -0.7    2.2    5.2    8.6...
104  2014    2.1    2.5    2.9    5.3    7.3    9.9...

     NOV    DEC     WIN    SPR    SUM    AUT   ANN  Unnamed: 3  Unnamed: 4  \
102  2.8    1.1    1.73   4.00  10.19   5.23  5.21         NaN         NaN
103  2.4    2.8    0.68   2.26  10.66   6.56  5.21         NaN         NaN
104                       2.48   5.17  10.46   NaN         NaN         NaN

     Unnamed: 5  Unnamed: 6  Unnamed: 7
102         NaN         NaN         NaN
103         NaN         NaN         NaN
104         NaN         NaN         NaN

我认为这还不是100%正确,但它已经接近了......希望你可以完全接受它。如果您使用Pandas,则无需手动编写如此多的代码。

答案 1 :(得分:0)

您可以使用numpy

中的genfromtxt功能
import numpy as np
data = np.genfromtxt('UK.txt',skiprows=8,delimiter=(4,7,7,7,7,7,7,7,7,7,7,7,7,8,7,7,7,8))

这将自动填充缺失值,但您仍需要找到一种方法来识别列的大小和要跳过的行数。

以下是如何从标题中获取列大小:

import re
header="Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT    NOV    DEC     WIN    SPR    SUM    AUT     ANN"
cols=re.findall("\s*[^\s]+",header)
delimiter=tuple([len(c) for c in cols])

答案 2 :(得分:0)

def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with  open(data_file) as f:
        counter = 0
        headers = []
        for line in f.readlines():
            line = line.strip()
            counter += 1
            if counter == 1:
                headers = re.findall('\w+',line)
                keys = headers
            else:
                values =  re.findall('([\d\-\.]+|(?:\s){3,4})(?:(?:\s){3,4})?',line)
                year = parse_to_int(values[0])

                if len(headers) != len(values):
                    diff_list = ['NaN' for i in range(len(headers) - len(values))]
                    values.extend(diff_list)
                data_list = [parse_to_float(x) for x in values[1:]]
                data_dict = collections.OrderedDict(zip(keys, data_list))
                result_dict[year] = data_dict

    return result_dict