我有一个高度非结构化的文本数据文件,其记录通常跨越多个输入行。
因此,总而言之,每条记录都应转换为类似于此结构的CSV记录: DD,MM,非结构化文本bla bla bla,number1,number2
数据的一个例子如下:
> 20 Sep This is the first record, bla bla bla 10.45
> Text unstructured
> of the second record bla bla
> 406.25 10001
> 6 Oct Text of the third record thatspans on many
> lines bla bla bla 60
> 28 Nov Fourth
> record
> 27.43
> Second record of the
> day/month BUT the fifth record of the file 500 90.25
我在Python中开发了以下解析器,但我无法弄清楚如何读取输入文件的多行,从逻辑上将它们视为一条独特的信息。我想我应该在另一个内部使用两个循环,但我不能处理循环索引。
非常感谢您的帮助!
# I need to deal with is_int() and is_float() functions to handle records with 2 numbers
# that must be separated by a csv_separator in the output record...
import sys
days_in_month = range(1,31)
months_in_year = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
csv_separator = '|'
def is_month(s):
if s in months_in_year:
return True
else:
return False
def is_day_in_month(n_int):
try:
if int(n_int) in days_in_month:
return True
else:
return False
except ValueError:
return False
#file_in = open('test1.txt','r')
file_in = open(sys.argv[1],'r')
#file_out = open("out_test1.txt", "w") # Use "a" instead of "w" to append to file
file_out = open(sys.argv[2], "w") # Use "a" instead of "w" to append to file
counter = 0
for line in file_in:
counter = counter + 1
line_arr = line.split()
date_str = ''
if is_day_in_month(line_arr[0]):
if len(line_arr) > 1 and is_month(line_arr[1]):
# Date!
num_month = months_in_year.index(line_arr[1]) + 1
date_str = '%02d' % int(line_arr[0]) + '/' + '%02d' % num_month + '/' + '2011' + csv_separator
elif len(line_arr) > 1:
# No date, but first number less than 31 (number of days in a month)
date_str = ' '.join(line_arr) + csv_separator
else:
# No date, and there is only a number less than 31 (number of days in a month)
date_str = line_arr[0] + csv_separator
else:
# there is not a date (a generic string, or a number higher than 31)
date_str = ' '.join(line_arr) + csv_separator
print >> file_out, date_str + csv_separator + 'line_number_' + str(counter)
file_in.close()
file_out.close()
答案 0 :(得分:2)
您可以使用类似的内容重新格式化输入文本。代码最有可能根据您输入中允许的内容进行一些清理。
list = file_in.readlines()
list2 = []
string =""
i = 0
while i < len(list):
## remove any leading or trailing white space then split on ' '
line_arr = list[i].lstrip().rstrip().split(' ')
您可能需要更改此部分,因为此处我假设记录必须以至少一个数字结尾。也有些人在尝试时皱眉/除了像这样使用。 (这部分来自How do I check if a string is a number (float) in Python?)
##check for float at end of line
try:
float(line_arr[-1])
except ValueError:
##not a float
##remove new line and add to previous line
string = string.replace('\n',' ') + list[i]
else:
##there is a float at the end of current line
##add to previous then add record to list2
string = string.replace('\n',' ') + list[i]
list2.append(string)
string = ""
i+=1
添加到您的代码中的输出是:
20/09/2011||line_number_1
Text unstructured of the second record bla bla 406.25 10001||line_number_2
06/10/2011||line_number_3
28/11/2011||line_number_4
Second record of the day/month BUT the fifth record of the file 500 90.25||line_number_5
我认为这与你正在寻找的很接近。
答案 1 :(得分:0)
我相信这是一个使用您的方法的一些基本要素的解决方案。当它识别出日期时,它会将其从行的开头处移开并保存以供后续使用。类似地,当它们存在时,它会从行的右端丢弃数字项而留下非结构化文本。
lines = '''\
20 Sep This is the first record, bla bla bla 10.45
Text unstructured
of the second record bla bla
406.25 10001
6 Oct Text of the third record thatspans on many
lines bla bla bla 60
28 Nov Fourth
record
27.43
Second record of the
day/month BUT the fifth record of the file 500 90.25'''
from string import split, join
days_in_month = [ str ( item ) for item in range ( 1, 31 ) ]
months_in_year = [ 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ]
lines = [ line . strip ( ) for line in split ( lines, '\n' ) if line ]
previous_date = None
previous_month = None
for line in lines :
item = split ( line )
#~ print item
if len ( item ) >= 2 and item [ 0 ] in days_in_month and item [ 1 ] in months_in_year :
previous_date = item [ 0 ]
previous_month = item [ 1 ]
item . pop ( 0 )
item . pop ( 0 )
try :
number_2 = float ( item [ -1 ] )
item . pop ( -1 )
except :
number_2 = None
number_1 = None
if not number_2 is None :
try :
number_1 = float ( item [ -1 ] )
item . pop ( -1 )
except :
number_1 = None
if number_1 is None and not number_2 is None :
number_1 = number_2
number_2 = None
if number_1 and number_1 == int ( number_1 ) : number_1 = int ( number_1 )
if number_2 and number_2 == int ( number_2 ) : number_2 = int ( number_2 )
print previous_date, previous_month, join ( item ), number_1, number_2