筛选和解析“太阳能区域摘要”文件中的文本

时间:2018-09-06 11:30:06

标签: python python-3.x

我试图过滤一些.txt文件,这些文件以YYYYMMDD格式的日期命名,并包含一些有关Sun中活动区域的数据。我编写了一个代码,给定了YYYYMMDD格式的日期,它可以列出一个时间范围内的文件,这些时间范围是我希望的活动区域,并根据该条目分析信息。这些txt的示例可以在下面看到,有关它的更多信息(如果您感到好奇的话)可以在SWPC website上看到。

:Product: 0509SRS.txt
:Issued: 2012 May 09 0030 UTC
# Prepared jointly by the U.S. Dept. of Commerce, NOAA,
# Space Weather Prediction Center and the U.S. Air Force.
#
Joint USAF/NOAA Solar Region Summary
SRS Number 130 Issued at 0030Z on 09 May 2012
Report compiled from data received at SWO on 08 May
I.  Regions with Sunspots.  Locations Valid at 08/2400Z 
Nmbr Location  Lo  Area  Z   LL   NN Mag Type
1470 S19W68   284  0030 Cro  02   02 Beta
1471 S22W60   277  0120 Cso  05   03 Beta
1474 N14W13   229  0010 Axx  00   01 Alpha
1476 N11E35   181  0940 Fkc  17   33 Beta-Gamma-Delta
1477 S22E73   144  0060 Hsx  03   01 Alpha
IA. H-alpha Plages without Spots.  Locations Valid at 08/2400Z May
Nmbr  Location  Lo
1472  S28W80   297
1475  N05W05   222
II. Regions Due to Return 09 May to 11 May
Nmbr Lat    Lo
1460 N16    126
1459 S16    110

我用来解析这些txt文件的代码是:

import glob

def seeker(noaa_number, t_start, path = None):
    '''
    This function will open an SRS file
    and look for each line if the given AR
    (specified by its NOAA number) is there.
    If so, this function should grab the
    entries and return them.
    '''

    #defaulting path if none is given
    if path is None:
        #assigning
        path = 'defaultpath'


    #listing the items within the directory
    files = sorted(glob.glob(path+'*.txt'))

    #finding the index in the list of
    #the starting time
    index = files.index(path+str(t_start)+'SRS.txt')

    #looping over each file
    for file in files[index: index+20]:

        #opening file
        f = open(file, 'r')

        #reading the lines
        text = f.readlines()

        #looping over each line in the text
        for line in text:

            #checking if the noaa number is mentioned
            #in the given line
            if noaa_number in line:

                #test print
                print('Original line: ', line)

                #slicing the text to get the column values
                nbr = line[:4]
                Location = line[5:11]
                Lo = line[14:18]
                Area = line[19:23]
                Z = line[24:28]
                LL = line[29:31]
                NN = line[34:36]
                MagType = line[37:]

                #test prints
                print('nbr: ', nbr)
                print('location: ', Location)
                print('Lo: ', Lo)
                print('Area: ', Area)
                print('Z: ', Z)
                print('LL: ', LL)
                print('NN: ', NN)
                print('MagType: ', MagType)

     return

我对此进行了测试,但仍能正常工作,但出于以下两个原因,我有点傻了:

  • 尽管这些文件是按照标准制作的,但考虑到我按索引对数组进行切片的方式,使代码崩溃所需的空间只有一个。有更好的选择吗?

  • 表IA和II上的信息与我无关,因此,理想情况下,我希望防止我的代码对其进行扫描。由于第一列的行数不同,是否可以告诉代码何时停止读取给定文档?

感谢您的时间!

1 个答案:

答案 0 :(得分:1)

健壮性:

您可以使用.split()方法将行拆分成列表,而不是按绝对位置进行切片。这样可以避免多余的空间。

所以不是

Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]

您可以使用

Location = line.split()[1]
Lo = line.split()[2]
Area = line.split()[3]
Z = line.split()[4]
LL = line.split()[5]
NN = line.split()[6]

如果您希望它更快,您可以将列表拆分一次,然后从同一列表中提取相关数据,而不是每次都拆分:

data = line.split()
Location = data[1]
Lo = data[2]
Area = data[3]
Z = data[4]
LL = data[5]
NN = data[6]

停止:

要在文件传递相关数据后阻止它继续读取文件,只要它不再在行中找到noaa_number,就可以退出循环

# In the file function but before looping through the lines. 

started_reading = False ## Set this to false so 
                        ## that it doesn't exit
                        ## before it gets to the 
                        ## relevant data

for line in text:
    if noaa_number in line:
        started_reading = True 

        ## Parsing stuff

    elif started_reading is True:
        break # exits the loop