解析以下统一数据

时间:2012-11-28 06:07:05

标签: python parsing

我正在尝试编写一个python脚本,它采用以下类型的信息:

http://ucolick.org/calendar/keckcal2009-20/keck2012.12dec

http://ucolick.org/calendar/keckcal2009-20/keck2012.18dec(完整数据转载如下)。正如您所看到的那样,机器已生成(并且这两个文件包含的数据略有不同)。此外,除了空格外,有几列没有任何内容。

我最终想要的是像词典

astro_times_dict['DEC 01']['TWILIGHT ENDS']['12'] = '18:33'
astro_times_dict['DEC 01']['TWILIGHT ENDS']['18'] = '19:00'

但我不确定一些明确的方法,这不是手工。我开始时:

for line in open('keck2012.12dec.txt').readlines():
    if len(line.split()) > 15:
        print line,

这将仅打印数据部分,但如何处理(有时)丢失的数据对我来说并不清楚。


以下是其中一个链接文件的全部内容:

                                KECK OBSERVATORY CALENDAR FOR 2012  -ASTRONOMICAL
                                 (computed for altitude 4160.0 m)
                                                                    ASTRONOMICAL(18 deg)  TWILIGHT/DAWN   MOON(midnight)
  DATE(HST)   SUN   TWILIGHT ENDS  MOON   MOON   DAWN BEGINS   SUN     SIDEREAL TIMES     NIGHT (18 deg)          Zenith
    2012      SET                  RISE   SET                  RISE   TWI    MID   DAWN   LENGTH DARK___    RA  DEC Dist
                      12     18                   18     12            18            18      h    h    %   h m  d m  deg
 SAT DEC 01  17 53  18 33  19 00  20 23         05 23  05 50  06 30  23 24  04 25  09 49   10.4  1.4  13  0735 1712   45
 SUN DEC 02  17 53  18 33  19 00  21 15         05 24  05 51  06 31  23 28  04 29  09 54   10.4  2.2  21  0824 1416   56
 MON DEC 03  17 53  18 33  19 00  22 06         05 24  05 51  06 32  23 32  04 33  09 57   10.4  3.1  29  0912 1041   68
 TUE DEC 04  17 53  18 33  19 00  22 59         05 25  05 52  06 32  23 36  04 37  10 02   10.4  4.0  38  1000 0633   79
 WED DEC 05  17 53  18 34  19 01  23 51         05 25  05 52  06 33  23 41  04 40  10 06   10.4  4.8  46  1045 0145  >90
 THU DEC 06  17 53  18 34  19 01  00 46         05 26  05 53  06 33  23 45  04 44  10 11   10.4  5.8  55  1138-0237  >90
 FRI DEC 07  17 54  18 34  19 01  01 42         05 27  05 54  06 34  23 49  04 48  10 16   10.4  6.7  64  1229-0719  >90
 SAT DEC 08  17 54  18 34  19 01  02 42  14 02  05 27  05 54  06 35  23 52  04 52  10 20   10.4  7.7  73  1323-1146  >90
 SUN DEC 09  17 54  18 35  19 02  03 44  14 49  05 28  05 55  06 35  23 57  04 56  10 25   10.4  8.7  83  1420-1541  >90
 MON DEC 10  17 55  18 35  19 02  04 49  15 41  05 28  05 55  06 36  00 01  05 00  10 29   10.4  9.8  93  1520-1842  >90
 TUE DEC 11  17 55  18 35  19 03  05 55  16 39  05 29  05 56  06 36  00 06  05 04  10 34   10.4 10.4 100  1624-2030  >90
 WED DEC 12  17 55  18 36  19 03  06 59  17 42  05 29  05 56  06 37  00 10  05 08  10 38   10.4 10.4 100  1728-2052  >90
 THU DEC 13  17 56  18 36  19 03         18 47  05 30  05 57  06 38  00 14  05 12  10 43   10.5 10.5 100  1831-1945  >90
 FRI DEC 14  17 56  18 37  19 04         19 52  05 30  05 58  06 38  00 19  05 16  10 47   10.4  9.6  92  1933-1718  >90
 SAT DEC 15  17 56  18 37  19 04         20 55  05 31  05 58  06 39  00 23  05 20  10 52   10.4  8.6  82  2031-1349  >90
 SUN DEC 16  17 57  18 37  19 05         21 56  05 32  05 59  06 39  00 28  05 24  10 57   10.4  7.6  72  2125-0938  >90
 MON DEC 17  17 57  18 38  19 05         22 52  05 32  05 59  06 40  00 32  05 28  11 01   10.4  6.7  63  2217-0506  >90
 TUE DEC 18  17 58  18 38  19 05         23 47  05 33  06 00  06 40  00 36  05 32  11 06   10.5  5.8  55  2306-0028  >90
 WED DEC 19  17 58  18 39  19 06         00 39  05 33  06 00  06 41  00 41  05 36  11 10   10.4  4.9  46  2355 0401   84
 THU DEC 20  17 59  18 39  19 06         01 30  05 34  06 01  06 41  00 45  05 40  11 14   10.5  4.1  38  0041 0818   73
 FRI DEC 21  17 59  18 40  19 07         02 21  05 34  06 01  06 42  00 50  05 44  11 18   10.4  3.2  30  0129 1209   61
 SAT DEC 22  18 00  18 40  19 07         03 11  05 35  06 02  06 42  00 54  05 47  11 23   10.5  2.4  22  0216 1527   50
 SUN DEC 23  18 00  18 41  19 08  14 22  04 02  05 35  06 02  06 43  00 59  05 51  11 27   10.5  1.6  14  0305 1804   39
 MON DEC 24  18 01  18 41  19 08  15 05  04 52  05 36  06 03  06 43  01 03  05 55  11 32   10.5  0.7   7  0355 1954   28
 TUE DEC 25  18 01  18 42  19 09  15 50  05 41  05 36  06 03  06 44  01 07  05 59  11 36   10.5  0.0   0  0447 2050   17
 WED DEC 26  18 02  18 42  19 10  16 38  06 29  05 36  06 04  06 44  01 12  06 03  11 40   10.4  0.0   0  0538 2049   06
 THU DEC 27  18 02  18 43  19 10  17 28         05 37  06 04  06 44  01 16  06 07  11 45   10.5  0.0   0  0630 1950   05
 FRI DEC 28  18 03  18 44  19 11  18 19         05 37  06 04  06 45  01 21  06 11  11 49   10.4  0.0   0  0721 1757   17
 SAT DEC 29  18 04  18 44  19 11  19 11         05 38  06 05  06 45  01 25  06 15  11 54   10.5  0.0   0  0811 1513   28
 SUN DEC 30  18 04  18 45  19 12  20 03         05 38  06 05  06 46  01 30  06 19  11 58   10.4  0.8   8  0900 1147   40
 MON DEC 31  18 05  18 45  19 12  20 55         05 38  06 06  06 46  01 34  06 23  12 02   10.4  1.7  16  0949 0746   51

          ONE LINE REFERS TO EVENING DATE       LAST QUARTER   Dec 06   15:32 UT
          AND FOLLOWING MORNING.                NEW MOON       Dec 13   08:41 UT
          All dates and times are zone HST      FIRST QUARTER  Dec 20   05:17 UT
          in upper table (except sid time).     FULL MOON      Dec 28   10:22 UT

3 个答案:

答案 0 :(得分:1)

这是我到目前为止所做的(基于迄今为止的评论)。

for line in open(filename).readlines():
    if len(line.split()) > 15:
        print line.strip().replace('      ', '  ').split('  ')

哪个输出:

['SAT DEC 01', '17 53', '18 33', '19 00', '20 23', '', ' 05 23', '05 50', '06 30', '22 57', '04 25', '10 16', ' 11.3', '1.8', '16', '0735 1712', ' 45']
['SUN DEC 02', '17 53', '18 33', '19 00', '21 15', '', ' 05 24', '05 51', '06 31', '23 01', '04 29', '10 21', ' 11.3', '2.7', '23', '0824 1416', ' 56']
['MON DEC 03', '17 53', '18 33', '19 00', '22 06', '', ' 05 24', '05 51', '06 32', '23 05', '04 33', '10 25', ' 11.3', '3.6', '31', '0912 1041', ' 68']
['TUE DEC 04', '17 53', '18 33', '19 00', '22 59', '', ' 05 25', '05 52', '06 32', '23 09', '04 37', '10 29', ' 11.3', '4.4', '39', '1000 0633', ' 79']
['WED DEC 05', '17 53', '18 34', '19 01', '23 51', '', ' 05 25', '05 52', '06 33', '23 14', '04 40', '10 33', ' 11.3', '5.3', '46', '1045 0145', '>90']
['THU DEC 06', '17 53', '18 34', '19 01', '00 46', '', ' 05 26', '05 53', '06 33', '23 17', '04 44', '10 38', ' 11.3', '6.2', '54', '1138-0237', '>90']
['FRI DEC 07', '17 54', '18 34', '19 01', '01 42', '', ' 05 27', '05 54', '06 34', '23 21', '04 48', '10 43', ' 11.3', '7.1', '62', '1229-0719', '>90']
['SAT DEC 08', '17 54', '18 34', '19 01', '02 42', '14 02', '05 27', '05 54', '06 35', '23 25', '04 52', '10 47', ' 11.3', '8.1', '71', '1323-1146', '>90']
['SUN DEC 09', '17 54', '18 35', '19 02', '03 44', '14 49', '05 28', '05 55', '06 35', '23 30', '04 56', '10 52', ' 11.3', '9.1', '80', '1420-1541', '>90']
['MON DEC 10', '17 55', '18 35', '19 02', '04 49', '15 41', '05 28', '05 55', '06 36', '23 34', '05 00', '10 56', ' 11.3 10.2', '90', '1520-1842', '>90']
['TUE DEC 11', '17 55', '18 35', '19 03', '05 55', '16 39', '05 29', '05 56', '06 36', '23 38', '05 04', '11 01', ' 11.3 11.3', '99', '1624-2030', '>90']
['WED DEC 12', '17 55', '18 36', '19 03', '06 59', '17 42', '05 29', '05 56', '06 37', '23 43', '05 08', '11 05', ' 11.3 11.3 100', '1728-2052', '>90']
['THU DEC 13', '17 56', '18 36', '19 03', '', ' 18 47', '05 30', '05 57', '06 38', '23 47', '05 12', '11 10', ' 11.3 11.2', '98', '1831-1945', '>90']
['FRI DEC 14', '17 56', '18 37', '19 04', '', ' 19 52', '05 30', '05 58', '06 38', '23 52', '05 16', '11 15', ' 11.3 10.1', '88', '1933-1718', '>90']
['SAT DEC 15', '17 56', '18 37', '19 04', '', ' 20 55', '05 31', '05 58', '06 39', '23 56', '05 20', '11 19', ' 11.3', '9.1', '79', '2031-1349', '>90']
['SUN DEC 16', '17 57', '18 37', '19 05', '', ' 21 56', '05 32', '05 59', '06 39', '00 00', '05 24', '11 24', ' 11.4', '8.1', '70', '2125-0938', '>90']
['MON DEC 17', '17 57', '18 38', '19 05', '', ' 22 52', '05 32', '05 59', '06 40', '00 05', '05 28', '11 28', ' 11.4', '7.1', '62', '2217-0506', '>90']
['TUE DEC 18', '17 58', '18 38', '19 05', '', ' 23 47', '05 33', '06 00', '06 40', '00 09', '05 32', '11 33', ' 11.4', '6.2', '54', '2306-0028', '>90']
['WED DEC 19', '17 58', '18 39', '19 06', '', ' 00 39', '05 33', '06 00', '06 41', '00 14', '05 36', '11 37', ' 11.4', '5.4', '47', '2355 0401', ' 84']
['THU DEC 20', '17 59', '18 39', '19 06', '', ' 01 30', '05 34', '06 01', '06 41', '00 18', '05 40', '11 42', ' 11.4', '4.5', '39', '0041 0818', ' 73']
['FRI DEC 21', '17 59', '18 40', '19 07', '', ' 02 21', '05 34', '06 01', '06 42', '00 23', '05 44', '11 46', ' 11.4', '3.7', '32', '0129 1209', ' 61']
['SAT DEC 22', '18 00', '18 40', '19 07', '', ' 03 11', '05 35', '06 02', '06 42', '00 27', '05 47', '11 50', ' 11.4', '2.8', '25', '0216 1527', ' 50']
['SUN DEC 23', '18 00', '18 41', '19 08', '14 22', '04 02', '05 35', '06 02', '06 43', '00 32', '05 51', '11 54', ' 11.4', '2.0', '17', '0305 1804', ' 39']
['MON DEC 24', '18 01', '18 41', '19 08', '15 05', '04 52', '05 36', '06 03', '06 43', '00 35', '05 55', '11 59', ' 11.4', '1.2', '10', '0355 1954', ' 28']
['TUE DEC 25', '18 01', '18 42', '19 09', '15 50', '05 41', '05 36', '06 03', '06 44', '00 40', '05 59', '12 03', ' 11.3', '0.4', ' 3', '0447 2050', ' 17']
['WED DEC 26', '18 02', '18 42', '19 10', '16 38', '06 29', '05 36', '06 04', '06 44', '00 44', '06 03', '12 08', ' 11.4', '0.0', ' 0', '0538 2049', ' 06']
['THU DEC 27', '18 02', '18 43', '19 10', '17 28', '', ' 05 37', '06 04', '06 44', '00 49', '06 07', '12 12', ' 11.3', '0.0', ' 0', '0630 1950', ' 05']
['FRI DEC 28', '18 03', '18 44', '19 11', '18 19', '', ' 05 37', '06 04', '06 45', '00 54', '06 11', '12 16', ' 11.3', '0.0', ' 0', '0721 1757', ' 17']
['SAT DEC 29', '18 04', '18 44', '19 11', '19 11', '', ' 05 38', '06 05', '06 45', '00 58', '06 15', '12 21', ' 11.3', '0.4', ' 3', '0811 1513', ' 28']
['SUN DEC 30', '18 04', '18 45', '19 12', '20 03', '', ' 05 38', '06 05', '06 46', '01 03', '06 19', '12 25', ' 11.3', '1.3', '11', '0900 1147', ' 40']
['MON DEC 31', '18 05', '18 45', '19 12', '20 55', '', ' 05 38', '06 06', '06 46', '01 07', '06 23', '12 30', ' 11.3', '2.2', '19', '0949 0746', ' 51']

我认为正确识别没有数据的列;并将剩余的列“保持在一起”足以从这里轻松解析。如果其他人看到错误或更好的方式做这些事情,我会暂时保持开放状态。

答案 1 :(得分:0)

这是一个pyparsing解决方案:

data = """\
                                KECK OBSERVATORY CALENDAR FOR 2012  -ASTRONOMICAL
                                 (computed for altitude 4160.0 m)
                                                                    ASTRONOMICAL(18 deg)  TWILIGHT/DAWN   MOON(midnight)
  DATE(HST)   SUN   TWILIGHT ENDS  MOON   MOON   DAWN BEGINS   SUN     SIDEREAL TIMES     NIGHT (18 deg)          Zenith
    2012      SET                  RISE   SET                  RISE   TWI    MID   DAWN   LENGTH DARK___    RA  DEC Dist
                      12     18                   18     12            18            18      h    h    %   h m  d m  deg
 SAT DEC 01  17 53  18 33  19 00  20 23         05 23  05 50  06 30  23 24  04 25  09 49   10.4  1.4  13  0735 1712   45
 SUN DEC 02  17 53  18 33  19 00  21 15         05 24  05 51  06 31  23 28  04 29  09 54   10.4  2.2  21  0824 1416   56
 MON DEC 03  17 53  18 33  19 00  22 06         05 24  05 51  06 32  23 32  04 33  09 57   10.4  3.1  29  0912 1041   68
 TUE DEC 04  17 53  18 33  19 00  22 59         05 25  05 52  06 32  23 36  04 37  10 02   10.4  4.0  38  1000 0633   79
 WED DEC 05  17 53  18 34  19 01  23 51         05 25  05 52  06 33  23 41  04 40  10 06   10.4  4.8  46  1045 0145  >90
 THU DEC 06  17 53  18 34  19 01  00 46         05 26  05 53  06 33  23 45  04 44  10 11   10.4  5.8  55  1138-0237  >90
 FRI DEC 07  17 54  18 34  19 01  01 42         05 27  05 54  06 34  23 49  04 48  10 16   10.4  6.7  64  1229-0719  >90
 SAT DEC 08  17 54  18 34  19 01  02 42  14 02  05 27  05 54  06 35  23 52  04 52  10 20   10.4  7.7  73  1323-1146  >90
 SUN DEC 09  17 54  18 35  19 02  03 44  14 49  05 28  05 55  06 35  23 57  04 56  10 25   10.4  8.7  83  1420-1541  >90
 MON DEC 10  17 55  18 35  19 02  04 49  15 41  05 28  05 55  06 36  00 01  05 00  10 29   10.4  9.8  93  1520-1842  >90
 TUE DEC 11  17 55  18 35  19 03  05 55  16 39  05 29  05 56  06 36  00 06  05 04  10 34   10.4 10.4 100  1624-2030  >90
 WED DEC 12  17 55  18 36  19 03  06 59  17 42  05 29  05 56  06 37  00 10  05 08  10 38   10.4 10.4 100  1728-2052  >90
 THU DEC 13  17 56  18 36  19 03         18 47  05 30  05 57  06 38  00 14  05 12  10 43   10.5 10.5 100  1831-1945  >90
 FRI DEC 14  17 56  18 37  19 04         19 52  05 30  05 58  06 38  00 19  05 16  10 47   10.4  9.6  92  1933-1718  >90
""".splitlines()


from pyparsing import *

weekday = oneOf("SUN MON TUE WED THU FRI SAT")
month = oneOf("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC")
integer = Word(nums)
real = Regex(r'\d+\.\d*')
time = Regex(r'\d\d \d\d').leaveWhitespace()
time.setParseAction(lambda t: ':'.join(t[0].split()))
spacer = White(' ', exact=2).suppress()
blankTime = White(' ', exact=5)
blankTime.setParseAction(replaceWith(None))

dataParser = (weekday("weekday") + month("month") + integer("mday") + spacer + 
                (((time|blankTime) + spacer).leaveWhitespace()*11)('times') + 
                real('twi_len') + real('dawn_len') + integer('twi_pct') + 
                restOfLine)


fields = """sunset twilight ends moonrise moonset dawn 
            begins sunrise astro_twi astro_mid astro_dawn""".split()

def labelTimes(tokens):
    """parse-time transform to add results names for each time field"""
    for fname, value in zip(fields, tokens.times):
        # assign results name for each field
        tokens[fname] = value
    # no longer need this name, delete it
    del tokens['times']

dataParser.setParseAction(labelTimes)


for line in data[6:]:
    print (line)
    vals = dataParser.parseString(line)
    # uncomment this line to see all field names
    # print vals.dump()
    print vals.moonrise, vals.moonset
    print

打印:

 SAT DEC 01  17 53  18 33  19 00  20 23         05 23  05 50  06 30  23 24  04 25  09 49   10.4  1.4  13  0735 1712   45
20:23 None

 SUN DEC 02  17 53  18 33  19 00  21 15         05 24  05 51  06 31  23 28  04 29  09 54   10.4  2.2  21  0824 1416   56
21:15 None

 MON DEC 03  17 53  18 33  19 00  22 06         05 24  05 51  06 32  23 32  04 33  09 57   10.4  3.1  29  0912 1041   68
22:06 None

 TUE DEC 04  17 53  18 33  19 00  22 59         05 25  05 52  06 32  23 36  04 37  10 02   10.4  4.0  38  1000 0633   79
22:59 None

etc.

Pyparsing返回一个ParseResults数据结构,该结构可以用作简单列表,或者作为带有键的dict或带有属性的对象(如果给任何解析器元素赋予了名称)。在示例代码中,我将展示如何使用字段名称来访问moonrise和moonset的已解析数据值。取消对vals.dump()的调用,以查看每行的所有有效字段名称和值。

Pyparsing的默认行为是在匹配解析器的元素时隐式跳过空格,因此我们必须在解析器的选定部分上调用leaveWhitespace来禁用它。在您给定的数据集中,看起来月亮和月落时间是唯一可能为空的,但此解析器将检测到任何丢失的时间,并将其报告为无。 (我不确定最右边的场地是什么,留给OP做练习。)

答案 2 :(得分:0)

The struct module对于解析固定宽度数据非常有用:

import struct

line = ' SAT DEC 01  17 53  18 33  19 00  20 23         05 23  05 50  06 30  23 24  04 25  09 49   10.4  1.4  13  0735 1712   45'
cols = [s.strip() for s in struct.unpack('5s8s' + 11 * '7s' + '5s5s4s11s5s', line)]
# ['SAT', 'DEC 01', '17 53', '18 33', '19 00', '20 23', '', '05 23', '05 50', '06 30', '23 24', '04 25', '09 49', '10.4', '1.4', '13', '0735 1712', '45']