我有一个名为“ aList”的列表
[
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n",
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n",
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n",
"<img src='folder.gif' alt='folder'> <a href='catalog.html'><tt>test all files in a directory/</tt></a></td>\n",
"<img src='/thredds/folder.gif' alt='folder'> <a href='enhancedcatalog.html'><tt>test enhanced catalog/</tt></a></td>\n",
"<hr size='1' noshade='noshade'><h3><a href='/abc/catalog.html'>abc</a> at <a href='http://www.abcd.com/'>csiro</a> see <a href='/abcd/serverinfo.html'> info </a><br>\n",
"data server [version 4.6.10 - 2017-04-19t16:32:55-0600] <a href='http://www.unidata.ucar.edu/software/thredds/current/tds/reference/index.html'> documentation</a></h3>\n"
]
我想检索所有如下所示的html链接
a.html?dataset=1
catalog.html
enhancedcatalog.html
/abcd/serverinfo.html
http://www.unidata.ucar.edu/software/thredds/current/tds/reference/index.html
我已经尝试过,但是没有返回预期的结果。请提供一些建议。
matching = [s for s in aList if ".html" in s]
print(matching)
答案 0 :(得分:2)
您可以使用正则表达式或使用BeautifulSoup来获取html中的href值。在这里,我已经使用正则表达式给出了代码。希望对您有帮助
>>> from itertools import product
>>> [dict(zip(d, p)) for p in product(*d.values())]
[{'x': 0, 'y': 2, 'z': 3},
{'x': 0, 'y': 2, 'z': 4},
{'x': 0, 'y': 3, 'z': 3},
{'x': 0, 'y': 3, 'z': 4},
{'x': 1, 'y': 2, 'z': 3},
{'x': 1, 'y': 2, 'z': 4},
{'x': 1, 'y': 3, 'z': 3},
{'x': 1, 'y': 3, 'z': 4}]
输出量
/abcd/serverinfo.html
Enhancedcatalog.html
http://www.unidata.ucar.edu/software/thredds/current/tds/reference/index.html
http://www.abcd.com/
a.html?dataset = 1
catalog.html
/abc/catalog.html