用文本编辑器提取模式

时间:2014-05-07 11:48:17

标签: bash awk sed grep

我有一个URL源页面,如:

href="http://path/to/file.bz2">german.txt.bz2</a> (2,371,487 bytes)</td>
  <td><a rel="nofollow" class="external text" href="http://a/web/page/">American cities</a></td>
  <td><a rel="nofollow" class="external text" href="http://another/page/to.bz2">us_cities.txt.bz2</a> (77,081 bytes)</td>
  <td><a rel="nofollow" class="external text" href="http://other/page/to/file.bz2">test.txt.bz2</a> (7,158,285 bytes)</td>
  <td>World's largest test password collection!<br />Created by <a rel="nofollow" class="external text" href="http://page/web.com/">Matt Weir</a>

我想使用sedawk这样的文字编辑器来精确提取在其末尾有.bz2的网页...

像:

http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2

你能帮帮我吗?

3 个答案:

答案 0 :(得分:4)

Sed and grep:

sed 's/.*href=\"\(.*\)\".*/\1/g' file | grep -oP '.*\.bz2$'

答案 1 :(得分:1)

使用正确的解析器。例如,使用xsh

open :F html input.html ;
for //a/@href['bz2' = xsh:matches(., '\.bz2$')]
    echo (.) ;

答案 2 :(得分:1)

$ sed -n 's/.*href="\([^"]*\.bz2\)".*/\1/p' file
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2