Question

我有一个高度非结构化的文本数据文件，其记录通常跨越多个输入行。

每个记录的字段用空格分隔，与普通文本一样，因此每个字段都必须通过其他信息识别，而不是“csv字段分隔符”。
许多不同的记录也会分享前两个字段，它们是：
- 月份日数（1至31）;
- 本月的前三个字母。
但我知道这个带有日期字段和月前缀字段的“特殊”记录是，后跟与相同“时间戳”相关的记录（日/月）不包含该信息。
我肯定知道第三个字段与很多单词的非结构化句子有关，例如“因为这个原因在该地方使用此工具执行的操作”
我知道每条记录都可以包含一个或两个数字字段作为最后一个字段。
我也知道每条新记录都以新行开头（当天/月的第一条记录和同一天/月的以下记录）。

因此，总而言之，每条记录都应转换为类似于此结构的CSV记录： DD，MM，非结构化文本bla bla bla，number1，number2

数据的一个例子如下：

> 20 Sep This is the first record, bla bla bla 10.45 
> Text unstructured
> of the second record bla bla
> 406.25 10001 
> 6 Oct Text of the third record thatspans on many 
> lines bla bla bla 60 
> 28 Nov Fourth 
> record 
> 27.43 
> Second record of the
> day/month BUT the fifth record of the file 500 90.25

我在Python中开发了以下解析器，但我无法弄清楚如何读取输入文件的多行，从逻辑上将它们视为一条独特的信息。我想我应该在另一个内部使用两个循环，但我不能处理循环索引。

非常感谢您的帮助！

# I need to deal with is_int() and is_float() functions to handle records with 2 numbers
# that must be separated by a csv_separator in the output record...

import sys

days_in_month = range(1,31)
months_in_year = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

csv_separator = '|'

def is_month(s):
    if s in months_in_year:
        return True
    else:
        return False 


def is_day_in_month(n_int):
    try:
        if int(n_int) in days_in_month:
            return True
        else:
            return False
    except ValueError:
        return False

#file_in = open('test1.txt','r')
file_in = open(sys.argv[1],'r')
#file_out = open("out_test1.txt", "w") # Use "a" instead of "w" to append to file
file_out = open(sys.argv[2], "w") # Use "a" instead of "w" to append to file

counter = 0
for line in file_in:
    counter = counter + 1
    line_arr = line.split()
    date_str = ''
    if is_day_in_month(line_arr[0]):
        if len(line_arr) > 1 and is_month(line_arr[1]):
            # Date!
            num_month = months_in_year.index(line_arr[1]) + 1
            date_str = '%02d' % int(line_arr[0]) + '/' + '%02d' % num_month + '/' + '2011' + csv_separator
        elif len(line_arr) > 1:
            # No date, but first number less than 31 (number of days in a month)
            date_str = ' '.join(line_arr) + csv_separator
        else:
            # No date, and there is only a number less than 31 (number of days in a month)
            date_str = line_arr[0] + csv_separator
    else:
        # there is not a date (a generic string, or a number higher than 31)
        date_str = ' '.join(line_arr) + csv_separator
    print >> file_out, date_str + csv_separator + 'line_number_' + str(counter)

file_in.close()
file_out.close()

Answer 1

您可以使用类似的内容重新格式化输入文本。代码最有可能根据您输入中允许的内容进行一些清理。

list = file_in.readlines()
list2 = []     
string =""
i = 0

while i < len(list):
   ## remove any leading or trailing white space then split on ' '
   line_arr = list[i].lstrip().rstrip().split(' ')

您可能需要更改此部分，因为此处我假设记录必须以至少一个数字结尾。也有些人在尝试时皱眉/除了像这样使用。（这部分来自How do I check if a string is a number (float) in Python?）

   ##check for float at end of line
   try:
      float(line_arr[-1])
   except ValueError:
      ##not a float 
      ##remove new line and add to previous line
      string = string.replace('\n',' ') +  list[i]
   else:
      ##there is a float at the end of current line
      ##add to previous then add record to list2
      string = string.replace('\n',' ') +  list[i]
      list2.append(string)
      string = ""
   i+=1

添加到您的代码中的输出是：

20/09/2011||line_number_1
Text unstructured of the second record bla bla 406.25 10001||line_number_2
06/10/2011||line_number_3
28/11/2011||line_number_4
Second record of the day/month BUT the fifth record of the file 500 90.25||line_number_5

我认为这与你正在寻找的很接近。

Answer 2

我相信这是一个使用您的方法的一些基本要素的解决方案。当它识别出日期时，它会将其从行的开头处移开并保存以供后续使用。类似地，当它们存在时，它会从行的右端丢弃数字项而留下非结构化文本。

lines = '''\
20 Sep This is the first record, bla bla bla 10.45 
Text unstructured
of the second record bla bla
406.25 10001 
6 Oct Text of the third record thatspans on many 
lines bla bla bla 60 
28 Nov Fourth 
record 
27.43 
Second record of the
day/month BUT the fifth record of the file 500 90.25'''

from string import split, join

days_in_month = [ str ( item ) for item in range ( 1, 31 ) ]
months_in_year = [ 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ]

lines = [ line . strip ( ) for line in split ( lines, '\n' ) if line ]

previous_date = None
previous_month = None
for line in lines :
    item = split ( line )
    #~ print item
    if len ( item ) >= 2 and item [ 0 ] in days_in_month and item [ 1 ] in months_in_year :
        previous_date = item [ 0 ] 
        previous_month = item [ 1 ] 
        item . pop ( 0 )
        item . pop ( 0 )
    try :
        number_2 = float ( item [ -1 ] )
        item . pop ( -1 )
    except :
        number_2 = None
    number_1 = None
    if not number_2 is None :
        try :
            number_1 = float ( item [ -1 ] )
            item . pop ( -1 )
        except :
            number_1 = None
    if number_1 is None and not number_2 is None :
        number_1 = number_2
        number_2 = None
    if number_1 and number_1 == int ( number_1 ) : number_1 = int ( number_1 )
    if number_2 and number_2 == int ( number_2 ) : number_2 = int ( number_2 )
    print previous_date, previous_month, join ( item ), number_1, number_2

读取记录遍布Python中的多个输入行

2 个答案: