我正在编写一段代码来从观察站点列表中提取数据(下面给出一个例子)。我目前有一个正则表达式列表,删除任何不包含我正在寻找的数据的行。所有正则表达式都成功指示包含元数据的行,但搜索日期的行除外。在regexr.com上测试时,表达式工作得很好,但是在运行代码时,我无法删除这些行。删除包含日期的行我缺少什么?
数据示例
! CD = 2 letter state (province) abbreviation
! STATION = 16 character station long name
! ICAO = 4-character international id
! IATA = 3-character (FAA) id
! SYNOP = 5-digit international synoptic number
! LAT = Latitude (degrees minutes)
! LON = Longitude (degree minutes)
! ELEV = Station elevation (meters)
! M = METAR reporting station. Also Z=obsolete? site
! N = NEXRAD (WSR-88D) Radar site
! V = Aviation-specific flag (V=AIRMET/SIGMET end point, A=ARTCC T=TAF U=T+V)
! U = Upper air (rawinsonde=X) or Wind Profiler (W) site
! A = Auto (A=ASOS, W=AWOS, M=Meso, H=Human, G=Augmented) (H/G not yet impl.)
! C = Office type F=WFO/R=RFC/C=NCEP Center
! Digit that follows is a priority for plotting (0=highest)
! Country code (2-char) is last column
!
!2345678901234567890123456789012345678901234567890123456789012345678901234567890 1234567890
!
ALASKA 16-DEC-13
CD STATION ICAO IATA SYNOP LAT LONG ELEV M N V U A C
AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7 US
AK AKHIOK PAKH AKK 56 56N 154 11W 14 X 8 US
AK AMBLER PAFM AFM 67 06N 157 51W 88 X 7 US
AK ANAKTUVUK PASS PAKP AKP 68 08N 151 44W 642 X 7 US
AK ANCHORAGE INTL PANC ANC 70273 61 10N 150 01W 38 X T X A 5 US
AK ANCHORAGE/WFO PAFC AFC 61 10N 150 02W 48 F 8 US
AK ANCHORAG/NIKISKI PAHG AHG 60 44N 151 21W 74 X 8 US
AK ANCHORAGE/LAKE H PALH LHD 61 11N 149 58W 22 X A 7 US
AK ANCHORAGE/ARTCC PZAN ZAN 61 10N 149 59W 22 A 8 US
AK ANCHORAGE/MERRIL PAMR MRI 61 13N 149 51W 41 X A 7 US
AK ANGOON SEAPLANE PAGN 57 30N 134 35W 2 X 8 US
AK ANIAK PANI ANI 70232 61 35N 159 32W 26 X 7 US
AK ANNETTE ISLAND PANT ANN 70398 55 02N 131 34W 36 X X A 5 US
AK ANVIK PANV ANV 62 39N 160 11W 99 X 7 US
AK ARCTIC VILLAGE PARC ARC 68 07N 145 35W 636 X 7 US
AK ATQASUK BURNELL PATQ ATK 70 28N 157 26W 29 X 7 US
AK ATKA PAAK AKA 52 13N 174 12W 17 X 7 US
AK BARROW PABR BRW 70026 71 17N 156 48W 7 X T X A 5 US
AK BARROW ARM-NSA 70027 71 19N 156 37W 7 X 8 US
AK BARTER ISLAND PABA BTI 70086 70 08N 143 35W 2 X W 7 US
AK BETHEL PABE BET 70219 60 47N 161 51W 41 X T X A 5 US
AK BETHEL/88D PABC ABC 60 48N 161 53W 49 X 8 US
AK BETTLES PABT BTT 70174 66 55N 151 31W 195 X T A 6 US
AK BIG RIVER LAKES PALV LVR 60 49N 152 18W 12 X 7 US
AK BIRCHWOOD PABV BCV 61 25N 149 31W 29 X 7 US
AK BREVIG_MISSION PFKT 65 20N 166 28W 9 X 7 US
AK BUCKLAND PABL BVK 65 59N 161 09W 7 X 7 US
AK CANTWELL PATW TTW 63 23N 148 57W 668 X 7 US
AK CAPE LISBURNE PALU LUR 70104 68 53N 166 08W 3 X T W 6 US
AK CAPE NEWENHAM PAEH EHM 70305 58 39N 162 04W 161 X T 6 US
AK CAPE ROMANZOF PACZ CZF 70212 61 47N 166 02W 146 X T 6 US
AK CENTRAL PARL 65 34N 144 47W 284 X 7 US
AK CENTRAL PACE 65 34N 144 47W 286 X 7 US
AK CENTRAL AK PROF CEN 70197 65 30N 144 41W 259 W 8 US
AK CHANDALAR LAKE PALR WCR 67 30N 148 29W 585 X 7 US
AK CHEVAK PAVA 61 32N 165 36W 23 X 7 US
AK CHIGNIK BAY PAJC AJC 56 19N 158 22W 15 X 7 US
AK CIRCLE/PAFC RFC PACR CRC 65 50N 144 04W 182 X R 7 US
AK COLD BAY PACD CDB 70316 55 12N 162 43W 30 X T X A 5 US
AK CORDOVA PACV CDV 70296 60 30N 145 30W 12 X T A 6 US
AK DEADHORSE PASC SCC 70 12N 148 28W 15 X T A 6 US
AK DEERING PADE DEE 66 04N 162 46W 5 X A 7 US
AK DELTA JUNCTION PABI BIG 70267 64 00N 145 44W 386 X T A 6 US
我的代码
station_file = open('../DATA/stations.txt', 'r')
data = station_file.read()
skip_res = ['^$', '^.*d{2}\-[A-Z]{3}\-\d{2}','^!'] #List of regular expressions which only appear in lines of metadata (not actual data)
data = data.split('\n')
for loop in data:
breakcheck = False # In the event a regular expression matches, this will turn to true and skip that line
for check in skip_res:
current = re.compile(check)
if current.search(loop) == None:
continue
else:
breakcheck = True
break
if breakcheck:
continue
else:
print(loop) # Should only print out lines containing actual data.
答案 0 :(得分:2)
匹配日期的模式在第一个\
之前缺少d
。将其更改为:
r'\d{2}-[A-Z]{3}-\d{2}'
由于您使用的是re.search()
,因此您无需从字符串的开头进行匹配。此外,您无需转义-
。
注意使用原始字符串(由r
前缀表示)来指定模式。通常,您应该使用原始字符串作为正则表达式模式,因为有一些字符串转义序列也是正则表达式模式,例如\b
。作为普通字符串,它表示退格字符。在原始字符串中,它被视为\
,后跟b
,这是“单词的开头或结尾”的正则表达式模式。
值得一提的另一件事是,您可以通过将模式与|
一起加入来检查多个模式的匹配。把它想象成“或”。然后您的代码可以更简洁地编写:
skip_res = [r'^$', r'\d{2}-[A-Z]{3}-\d{2}',r'^!']
skip_pattern = r'|'.join(skip_res)
with open ('../DATA/stations.txt', 'r') as station_file:
for line in station_file:
if re.search(skip_pattern, line):
continue
print(line)
编译正则表达式模式时,只有少数几个正则表达模式没有任何好处,因为re
模块将缓存它们。
答案 1 :(得分:0)
您的日期正则表达式在第一个“d”之前缺少反斜杠。
'^.*d{2}\-[A-Z]{3}\-\d{2}'
应该是
'^.*\d{2}\-[A-Z]{3}\-\d{2}'