我有一个kml文件,其中包含目的地列表以及坐标。此文件中约有40多个目的地。我正在尝试从中解析坐标,当您在文件中查看时,会看到“ coordinates ......” / coordinates,因此找到它们将不是困难的部分,但我看不到获得完整的结果。我的意思是,它将减少-94。或任何开头的负浮点数,然后打印其余部分。
#!/usr/bin/python3.5
import re
def main():
results = []
with open("file.kml","r") as f:
contents = f.readlines()
if f.mode == 'r':
print("reading file...")
for line in contents:
coords_match = re.search(r"(<coordinates>)[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)
if coords_match:
coords_matchh = coords_match.group()
print(coords_matchh)
这是我得到的一些结果
3502969,38.8555497
7662462,38.8583916
6280323,38.8866337
3655059,39.3983001
如果格式有所不同,这就是文件中is格式的方式
<coordinates>
-94.5944738,39.031411,0
</coordinates>
如果我修改此行,并从开头删除坐标
coords_match = re.search(r"[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)
这是我得到的结果。
-94.7662462
-94.6280323
-94.3655059
这本质上是我想要的结果。
-94.7662462,38.8583916
-94.6280323,38.8866337
-94.3655059,39.3983001
答案 0 :(得分:1)
使用实际的解析器是一种方法,就像@Kendas在评论中建议的那样,您可以尝试使用findall
而不是search
>>> import re
>>> s = """<coordinates>
... -94.5944738,39.031411,0
... </coordinates>"""
>>> re.findall(r'[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)', s)
['-94.5944738', '39.031411']
答案 1 :(得分:1)
您还可以使用BeauitfulSoup来获取坐标,因为它将是XML / HTML类型的解析。
from bs4 import BeautifulSoup
text = """<coordinates>
-94.5944738,39.031411,0
</coordinates>
<coordinates>
-94.59434738,39.032311,0
</coordinates>
<coordinates>
-94.523444738,39.0342411,0
</coordinates>"""
soup = BeautifulSoup(text, "lxml")
coordinates = soup.findAll('coordinates')
for i in range(len(coordinates)):
print(coordinates[i].text.strip()[:-2])
输出:
-94.5944738,39.031411
-94.59434738,39.032311
-94.523444738,39.0342411
答案 2 :(得分:1)
如果您只想提取简单且定界的数据,那么XML解析器就显得过分了。
最主要的是使用 simpler 正则表达式,并搜索整个文件。专注于捕获标签之间的所有内容:
with open("file.kml","r") as f:
contents = f.read()
coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', contents, re.DOTALL)
这将返回匹配列表。此列表中的每个项目都将如下所示:
'\n -94.5944738,39.031411,0\n '
因此,对于每个项目,您都需要:
所以您可以这样做:
results = [c.strip().rsplit(',', 1)[0] for c in coords_match]
这会为您提供所需字符串的列表。
如果您实际上想使用数字,我会将数字转换为浮点数(使用嵌套的理解):
results = [tuple(float(f) for f in c.strip().split(',')[:2]) for c in coords_match]
这将为您提供float
的2元组的列表。
IPython中的演示
In [1]: import re
In [2]: text = """<coordinates>
...: -94.5944738,39.031411,0
...: </coordinates>
...: <coordinates>
...: -94.59434738,39.032311,0
...: </coordinates>
...: <coordinates>
...: -94.523444738,39.0342411,0
...: </coordinates>"""
In [3]: coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', text, re.DOTALL)
Out[3]:
['\n -94.5944738,39.031411,0\n ',
'\n -94.59434738,39.032311,0\n ',
'\n -94.523444738,39.0342411,0\n ']
In [4]: results1 = [c.strip().rsplit(',', 1)[0] for c in coords_match]
Out[4]: ['-94.5944738,39.031411', '-94.59434738,39.032311', '-94.523444738,39.0342411']
In [5]: results2 = [tuple(float(f) for f in c.strip().split(',')[:2]) for c in coords_match]
Out[5]:
[(-94.5944738, 39.031411),
(-94.59434738, 39.032311),
(-94.523444738, 39.0342411)]
编辑::如果要将数据另存为SJON,则最好使用转换来浮动。因为可以直接将其转换为JSON:
In [6]: import json
In [7]: print(json.dumps(results2, indent=2))
[
[
-94.5944738,
39.031411
],
[
-94.59434738,
39.032311
],
[
-94.523444738,
39.0342411
]
]