如何从字符串中提取2个坐标对?

时间:2018-08-15 12:57:18

标签: python regex python-3.x coordinates

我是python的新手,陷入了一个我无法解决的小问题。我尝试从一个字符串中提取2个坐标对,并陷入困境,因为该字符串没有像逗号一样的av公共分隔符。

我的字符串如下:

&BBOX=151406.25%2C6579062.5%2C151875%2C6579531.25&
&BBOX=156298.828125%2C6576689.453125%2C156328.125%2C6576718.75
&BBOX=156328.125,6576806.640625%2C156357.421875%2C6576835.9375
&BBOX=156328.125,6576748.046875,156357.421875,6576777.34375& ?BBOX=156328%2C125%2C6576777%2C34375%2C156357%2C421875%2C6576806%2C640625&
&BBOX=156269.53125%2C6576689.453125%2C156298.828125%2C6576718.75&
&BBOX=156298.828125%2C6576718.75%2C156328.125%2C6576748.046875
?BBOX=156386.71875%2C6576806.640625%2C156416.015625%2C6576835.9375&

每个字符串都以"BBOX="开头,之后有4个坐标。 x_miny_minx_maxy_max。我使用"BBOX="来查找我的坐标在更长的字符串中的位置

x_minx_max应该是6位数字,而y_miny_max应该是7位数字。 它们可以是浮点值或整数值。

我认为我会在之前将坐标拆分为一个部分。和之后。但我真的想知道那是不是要走的路

现在我的正则表达式如下:

rexp_bbox = r"(^.+BBOX=(?P<bbox_xmin_before>\d.*?)[.,%&\s](?P<bbox_xmin_after>.*?)[.,%2C&\s](?P<bbox_ymin_before>\d.*?)[.,%&\s](?P<bbox_ymin_after>.*?)[.,%&\s](?P<bbox_xmax_before>\d.*?)[.,%&\s](?P<bbox_xmax_after>.*?)[.,%&\s](?P<bbox_ymax_before>\d.*?)[.,%&\s](?P<bbox_ymax_after>.*?)[.,%&\s])"

您将如何构造正则表达式来提取两个坐标对?

4 个答案:

答案 0 :(得分:1)

模式"(?:.*BBOX=)(\d{6}(?:\.?[\d]*))(?:%2C|,)(\d{7}(?:\.?[\d]*))(?:%2C|,)(\d{6}(?:\.?[\d]*))(?:%2C|,)(\d{7}(?:\.?[\d]*))"有效,并将坐标提取为4组。组1 = min_x,组2 = min_y,组3 = max_x,组4 = max_y

以下代码显示了运行中的模式:

import re

orig_coords = [
  '&BBOX=151406.25%2C6579062.5%2C151875%2C6579531.25&',
  '&BBOX=156298.828125%2C6576689.453125%2C156328.125%2C6576718.75',
  '&BBOX=156328.125,6576806.640625%2C156357.421875%2C6576835.9375',
  '&BBOX=156328.125,6576748.046875,156357.421875,6576777.34375&',
  '?BBOX=156328%2C125%2C6576777%2C34375%2C156357%2C421875%2C6576806%2C640625&',
  '&BBOX=156269.53125%2C6576689.453125%2C156298.828125%2C6576718.75&',
  '&BBOX=156298.828125%2C6576718.75%2C156328.125%2C6576748.046875',
  '?BBOX=156386.71875%2C6576806.640625%2C156416.015625%2C6576835.9375&'
]

bbox_start = "(?:.*BBOX=)"
separator = "(?:%2C|,)"
coord_6 = "(\d{6}(?:\.?[\d]*))"
coord_7 = "(\d{7}(?:\.?[\d]*))"
regex_str = bbox_start + coord_6 + separator + coord_7 + separator + coord_6 + separator + coord_7
reg = re.compile(regex_str)

for c in orig_coords:
  r = reg.match(c)
  if r:
    print('Coordinates for {}'.format(c))
    print('x_min: {} x_max: {}'.format(r.group(1), r.group(3)))
    print('y_min: {} y_max: {}'.format(r.group(2), r.group(4)))
  else:
    print('No match for {}'.format(c))

输出:

Coordinates for &BBOX=151406.25%2C6579062.5%2C151875%2C6579531.25&
x_min: 151406.25 x_max: 151875
y_min: 6579062.5 y_max: 6579531.25
Coordinates for &BBOX=156298.828125%2C6576689.453125%2C156328.125%2C6576718.75
x_min: 156298.828125 x_max: 156328.125
y_min: 6576689.453125 y_max: 6576718.75
Coordinates for &BBOX=156328.125,6576806.640625%2C156357.421875%2C6576835.9375
x_min: 156328.125 x_max: 156357.421875
y_min: 6576806.640625 y_max: 6576835.9375
Coordinates for &BBOX=156328.125,6576748.046875,156357.421875,6576777.34375&
x_min: 156328.125 x_max: 156357.421875
y_min: 6576748.046875 y_max: 6576777.34375
No match for ?BBOX=156328%2C125%2C6576777%2C34375%2C156357%2C421875%2C6576806%2C640625&
Coordinates for &BBOX=156269.53125%2C6576689.453125%2C156298.828125%2C6576718.75&
x_min: 156269.53125 x_max: 156298.828125
y_min: 6576689.453125 y_max: 6576718.75
Coordinates for &BBOX=156298.828125%2C6576718.75%2C156328.125%2C6576748.046875
x_min: 156298.828125 x_max: 156328.125
y_min: 6576718.75 y_max: 6576748.046875
Coordinates for ?BBOX=156386.71875%2C6576806.640625%2C156416.015625%2C6576835.9375&
x_min: 156386.71875 x_max: 156416.015625
y_min: 6576806.640625 y_max: 6576835.9375

您可以自己运行代码on repl.it

无法使用此模式的一个坐标似乎未遵循您在问题中发布的规则。

答案 1 :(得分:0)

a = "&BBOX=151406.25%2C6579062.5%2C151875%2C6579531.25&"
ans = a.split('=')[1].split('&')[0].split('%')

在这里拆分可能会有用,而不是复杂的正则表达式,但这还取决于您完全拥有哪种字符串。

答案 2 :(得分:0)

类似的事情似乎也可行;尚不确定此和Jim Wright的答案之间是否存在任何运行时差异。

import re

coords = ["&BBOX=151406.25%2C6579062.5%2C151875%2C6579531.25&",
"&BBOX=156298.828125%2C6576689.453125%2C156328.125%2C6576718.75",
"&BBOX=156328.125,6576806.640625%2C156357.421875%2C6576835.9375",
"&BBOX=156328.125,6576748.046875,156357.421875,6576777.34375& ?BBOX=156328%2C125%2C6576777%2C34375%2C156357%2C421875%2C6576806%2C640625&",
"&BBOX=156269.53125%2C6576689.453125%2C156298.828125%2C6576718.75&",
"&BBOX=156298.828125%2C6576718.75%2C156328.125%2C6576748.046875",
"?BBOX=156386.71875%2C6576806.640625%2C156416.015625%2C6576835.9375&"]

r = re.compile(r"&BBOX=(.+?)(?=&|$)")

x_coords = []

def split_coords(coords_string):
    if "%2C" in coords_string:
        bbox = coords_string.split('%2C')
    else:
        bbox = coords_string.split(",")
    x_min, x_max = [bbox[0], bbox[2]]
    return (x_min, x_max)

# If a match is found using the regex, split the coords and add the x_min and x_max coords to the x_coords array
for i in coords:
    match = r.match(i)
    if match:
        match = match.group(1)
        x_coords.append(split_coords(match))

答案 3 :(得分:0)

您的评论对让我以另一种方式思考非常有帮助。我只是没有注意到%2C是坐标之间的常见分隔符。我将正则表达式修改为:

rexp_bbox = r“(^。+ BBOX =(?P \ d。?)(%2C)(?P \ d。?)(%2C)(?P \ d。 ?)(%2C)(?P \ d。?)(\ s |&| \“))”

它能解决问题,因为我在日志文件解析中使用正则表达式,其中我计算了某些边界框的数量(我的问题中的坐标是边界框的角坐标)