正在删除<a> tag in th middle of othre tags

时间:2016-06-07 22:25:47

标签: bash sed grep find

I have several lines in html files that look like this:

<div class="thumb tright">
   <div class="thumbinner" style="width:302px;">
       <a href="https://example.com/en/File:Tools_my_settings.png" class="image">
          <img alt="" src="images_en/thumb/0/0a/tool_settings.png/9dd94c2d99eea9.png" width="300" height="110" class="thumbimage" srcset="/my/en/images_en/thumb/0/0a/my_settings.png/450px-my_settings.png 1.5x, /31/en/images_en/thumb/0/0a/my_settings.png/600px-my_settings.png 2x"/>
       </a> 
       <div class="thumbcaption">
           <div class="magnify">
              <a href="https://example.com/en/File:Tools_my_settings.png" class="internal" title="Enlarge"></a>
           </div>
           Tool settings
       </div>
    </div>
</div>Tools Features - So Far

I need to delete the following href and and the corresponding closing tag </a> immediately after the .png 2x"/> text element.

<a href="https://example.com/en/File:**Tools_my_settings.png" class="image">...</a>

at the end I need the line to look like this:

<div class="thumb tright">
    <div class="thumbinner" style="width:302px;">
        <img alt="" src="images_en/thumb/0/0a/tool_settings.png/9dd94c2d99eea9.png" width="300" height="110" class="thumbimage" srcset="/my/en/images_en/thumb/0/0a/my_settings.png/450px-my_settings.png 1.5x, /31/en/images_en/thumb/0/0a/my_settings.png/600px-my_settings.png 2x"/>
        <div class="thumbcaption">
            <div class="magnify">
                <a href="https://example.com/en/File:Tools_my_settings.png" class="internal" title="Enlarge"></a>
            </div>
            Tool settings
        </div>
    </div>
</div>Tools Features - So Far

All files contain the same patern:<a href="https://choopy.com/en/File:... this is what I have tried:

find /var/www/clients/client1/web2/web/lms_docs/ -type f -print0 | xargs -0 sed 's/<a\shref="https:\/\/choopy.com\/en\/File:([--:\w?@%&+~#=]*[a-z])\.png"\sclass="image">//g'

but it doesn't do anything and i don't know how to delete the corresponding closing tag </a>

1 个答案:

答案 0 :(得分:0)

这会删除<a href>课程的https://...com的所有image和相应的</a>

find /var/www/clients/client1/web2/web/lms_docs/ -type f -print0 | xargs -0 sed '/<a href=\"https:\/\/.*\.com\/en\/File:.*\" class=\"image\">/,/<\/a>/{ /<a href=\"https:\/\/.*\.com\/en\/File:.*\" class=\"image\">/d; /<\/a>/d}'

这个是针对特定域的,https://example.com

find /var/www/clients/client1/web2/web/lms_docs/ -type f -print0 | xargs -0 sed '/<a href=\"https:\/\/example\.com\/en\/File:.*\" class=\"image\">/,/<\/a>/{ /<a href=\"https:\/\/example\.com\/en\/File:.*\" class=\"image\">/d; /<\/a>/d}'

这样的工作原理如下:“匹配<a href ....与class图片之间的所有行以及相应的<\a>sed模式匹配:”/ /“ ) 然后,对于匹配的块,执行“{}”:匹配相同的模式并将其删除为“/ d”。

更多信息:section 4.24