Question

我想从给定的URl中提取.zip文件名。这是我的代码 -

import re

print(re.findall(r'href=[\'"]?([^\'" >]+)','<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'))

例如：

输入 - <a href="http://www.example.com/files/world_data1.zip">World Data Part 1</a> <a href="http://www.example.com/files/world_data2.zip">World Data Part 2</a>

预期输出 - world_data1.zip,world_data2.zip。

我尝试以各种格式使用.zip $，但我得到一个空列表。任何人都可以帮我这个吗？

Answer 1

您可以使用

export PATH=/cool/new/version/perl:$PATH
#  now execute script on following line
/path/to/myscript.pl

屈服

import re

html = """'&nbsp;<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'"""

rx = re.compile(r"""href=(["'])(.*?)\1""")
links = [filename 
    for m in rx.finditer(html) 
    for filename in [m.group(2).split('/')[-1]]
    if filename.endswith('.zip')]
print(links)

<小时/> 我们的想法是首先获取

['world_data1.zip', 'world_data2.zip']

属性，按href拆分并检查最后一部分是否以/结尾。
但是，请考虑使用.zip和一些BeautifulSoup查询等解析器有关表达式，请参阅a demo on regex101.com。

Answer 2

你可以试试这个：

xpath

或者，更严格地说，使用以下方式：

import re

s = '&nbsp;<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'

print(re.findall(r'href="[^"]+?/([^/"]+\.zip)"', s))

使用Python中的regex从给定的URL中提取.zip文件名

2 个答案: