Question

我有一个名为“ aList”的列表

[
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n", 
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n", 
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n", 
"<img src='folder.gif' alt='folder'> &nbsp;<a href='catalog.html'><tt>test all files in a directory/</tt></a></td>\n", 
"<img src='/thredds/folder.gif' alt='folder'> &nbsp;<a href='enhancedcatalog.html'><tt>test enhanced catalog/</tt></a></td>\n",
"<hr size='1' noshade='noshade'><h3><a href='/abc/catalog.html'>abc</a> at <a href='http://www.abcd.com/'>csiro</a> see <a href='/abcd/serverinfo.html'> info </a><br>\n", 
"data server [version 4.6.10 - 2017-04-19t16:32:55-0600] <a href='http://www.unidata.ucar.edu/software/thredds/current/tds/reference/index.html'> documentation</a></h3>\n"
]

我想检索所有如下所示的html链接

a.html?dataset=1
catalog.html
enhancedcatalog.html
/abcd/serverinfo.html
http://www.unidata.ucar.edu/software/thredds/current/tds/reference/index.html

我已经尝试过，但是没有返回预期的结果。请提供一些建议。

matching = [s for s in aList if ".html" in s]
print(matching)

Answer 1

您可以使用正则表达式或使用BeautifulSoup来获取html中的href值。在这里，我已经使用正则表达式给出了代码。希望对您有帮助

>>> from itertools import product
>>> [dict(zip(d, p)) for p in product(*d.values())]
[{'x': 0, 'y': 2, 'z': 3},
 {'x': 0, 'y': 2, 'z': 4},
 {'x': 0, 'y': 3, 'z': 3},
 {'x': 0, 'y': 3, 'z': 4},
 {'x': 1, 'y': 2, 'z': 3},
 {'x': 1, 'y': 2, 'z': 4},
 {'x': 1, 'y': 3, 'z': 3},
 {'x': 1, 'y': 3, 'z': 4}]

输出量

/abcd/serverinfo.html
  Enhancedcatalog.html
  http://www.unidata.ucar.edu/software/thredds/current/tds/reference/index.html
  http://www.abcd.com/
  a.html？dataset = 1
  catalog.html
  /abc/catalog.html

在列表中找到一个html链接地址字符串

1 个答案: