在HTML / XML文件中有效搜索

时间:2014-12-20 17:37:15

标签: html ubuntu awk sed grep

用例:我有一个html文件&我需要在其中搜索文本。

说我的文件是 - :

</script> <!-- right side bands --> </div> </div> <div class="fclear"></div> <div class="fk-mainfooter fksk-mainfooter tpadding20 bpadding5 new-vd" id="fk-mainfooter-id"> <div class="fk-content fksk-content bpadding10"> <div class="line tpadding20 bpadding20 footer-dark-top-border"> <div class="unit fk-footer-links-container"> <div class="unit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Help</strong></span> <a class="fk-footer-unit fk-footer-link" href="/s/help/payments">Payments</a> <a class="fk-footer-unit fk-footer-link" href="/help/savedcard_how">Saved Cards</a> <a class="fk-footer-unit fk-footer-link" href="/s/help/shipping">Shipping</a> <a class="fk-footer-unit fk-footer-link" href="/s/help/cancellation-returns">Cancellation &amp;
 Returns</a> <a class="fk-footer-unit fk-footer-link" href="/s/help">FAQ</a> <a class="fk-footer-unit fk-footer-link" href="https://seller.flipkart.com/fiv">Report Infringement</a> </div> <div class="unit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Flipkart</strong></span> <a class="fk-footer-unit fk-footer-link" href="/s/contact">Contact Us</a> <a class="fk-footer-unit fk-footer-link" href="/about-us">About Us</a> <a class="fk-footer-unit fk-footer-link" target="_blank" href="/ol?link=http%3A%2F%2Fflipkartcareers.com%2F">Careers</a> <a class="fk-footer-unit fk-footer-link" href="/ol?link=http%3A%2F%2Fblog.flipkart.com%2F">Blog</a> <a class="fk-footer-unit fk-footer-link" href="/s/press">Press</a> <a class="fk-footer-unit fk-footer-link" target="_blank" href="http://slashn.flipkart.net/">Slash N</a> </div> <div class="unit size1of4 required-tracking" data-tracking-id="ch_vn"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Flipkart eBooks</strong></span> <a class="fk-footer-unit fk-footer-link" href="/ebooks/gettingstarted" target="_blank">eBooks Quick Start Guide</a> <a class="fk-footer-unit fk-footer-link" href="/help/flyteeBookfaq" target="_blank">eBooks FAQ</a> <a class="fk-footer-unit fk-footer-link" href="/ebooks/apps" target="_blank">eBooks App</a> <a href="/mobile-apps" data-tracking-id="mobile_apps"><div class="footer-mobile-apps lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png"></div></a> </div> <div class="lastUnit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Misc</strong></span> <a class="fk-footer-unit fk-footer-link" href="http://www.flipkart.com">Online Shopping</a> <a class="fk-footer-unit fk-footer-link" href="/affiliate/" target="_blank">Affiliate</a> <a class="fk-footer-unit fk-footer-link" href="/buy-gift-voucher">e-Gift Voucher</a> <a class="fk-footer-unit fk-footer-link" href="/?sitevariant=mobile">Flipkart lite</a> <a class="fk-footer-unit fk-footer-link" href="/flipkart-first">Flipkart First Subscription</a> <a class="fk-footer-unit fk-footer-link" href="/elearning-faq">eLearning FAQ</a> </div> </div> <div class="lastUnit size10f4 lpadding15 fk-trust-boosters"> <p class="fk-footer-sub-head"><strong>Safe and Secure Shopping</strong></p> <p class="bpadding15 fk-trust-content">All major credit and debit cards are accepted. We also accept payments by <strong>Internet Banking, Cash on Delivery</strong> and <strong>Equated Monthly Installments(EMI).</strong></p> </div> </div> <div class="fk-footer-ssa"> <a href="/account/orders?srcLink=footer" class="login-required"> <ul class="line ssa-block"> <li class="unit size1of3 ssa-unit"><i class="icon track-icon"></i><span class="text">Track your<br /> order</span></li> <li class="unit size1of3 ssa-unit"><i class="icon return-icon"></i><span class="text">Free &amp; easy<br /> returns</span></li> <li class="lastUnit ssa-unit"><i class="icon cancel-icon"></i><span class="text">Online<br /> cancellations</span></li> </ul> </a> </div> <div class="line fk-footer-policy"> <div class="unit tpadding5 tc-links"> <span><span class="policies-title boldtext">Policies:</span> <a href="/s/terms">Terms of use</a> | <a href="/s/paymentsecurity">Security</a> | <a
 href="/s/privacypolicy">Privacy</a> | <a
 href="https://seller.flipkart.com/fiv">Infringement</a></span> <span class="fk-footet-cr">&copy; 2007-2014 <span>Flipkart.com.</span></span> </div> <div class="fk-footer-kit unitExt fk-inline-block"> <strong class="title fk-float-left">Keep in touch</strong> <a class="facebook_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.facebook.com%2Fflipkart"></a> <a class="twitter_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.twitter.com%2Fflipkart"></a> <a class="google-plus_icn inner fk-sprite-hf rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=https%3A%2F%2Fplus.google.com%2F109591199284807005836%2Fposts"></a> <a class="youtube_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.youtube.com%2Fflipkart"></a> </div> </div> <div class="line top-brand-links tpadding10 bpadding10"> <div class="line boldtext bpadding10 top-brands-title">
 Top Stores : <a href="/brands">Brand Directory</a> | <a href="/store-directory">Store Directory</a> </div> <div class="line"> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Most searched for on Flipkart: </div> <div class="lastUnit"> <a href="/games/call-of-duty~series/pr?sid=4rr,tg9"> Call Of Duty</a>
 | <a href="/androidone"> Android One</a>
 | <a href="/offers"> Diwali Offers</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Mobiles: </div> <div class="lastUnit"> <a href="/moto-e/p/itmdvuwsybgnbtha"> Moto E</a>
 | <a href="/q/samsung-mobiles"> Samsung Mobile</a>
 | <a href="/q/micromax-mobiles"> Micromax Mobile</a>
 | <a href="/q/nokia-mobiles"> Nokia Mobile</a>
 | <a href="/q/htc-mobiles"> HTC Mobile</a>
 | <a href="/q/sony-mobiles"> Sony Mobile</a>
 | <a href="/q/apple-mobiles"> Apple Mobile</a>
 | <a href="/q/lg-mobiles"> LG Mobile</a>
 | <a href="/q/karbonn-mobiles"> Karbonn Mobile</a>
 | <a href="/mobiles">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Camera: </div> <div class="lastUnit"> <a href="/q/nikon-cameras"> Nikon Camera</a>
 | <a href="/q/canon-cameras"> Canon Camera</a>
 | <a href="/q/sony-cameras"> Sony Camera</a>
 | <a href="/cameras/samsung~brand/pr?sid=jek,p31"> Samsung Camera</a>
 | <a href="/q/canon-dslr-cameras"> Canon DSLR</a>
 | <a href="/q/nikon-dslr-cameras"> Nikon DSLR</a>
 | <a href="/cameras/dslr~type/pr?sid=jek,p31"> DSLR Camera</a>
 | <a href="/camera-accessories/lenses/pr?sid=jek,6l2,e9y"> Camera Lens</a>
 | <a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
 | <a href="/cameras">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Laptops: </div> <div class="lastUnit"> <a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple Laptop</a>
 | <a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
 | <a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
 | <a href="/q/lenovo-laptops"> Lenovo Laptop</a>
 | <a href="/q/sony-laptops"> Sony Laptop</a>
 | <a href="/q/dell-laptops"> Dell Laptop</a>
 | <a href="/laptops/asus~brand/pr?sid=6bo,b5g"> Asus Laptop</a>
 | <a href="/laptops/toshiba~brand/pr?sid=6bo,b5g"> Toshiba Laptop</a>
 | <a href="/laptops/lg~brand/pr?sid=6bo,b5g"> LG Laptop</a>
 | <a href="/q/hp-laptops"> HP Laptop</a>
 | <a href="/laptops/~notebook/pr?sid=6bo,b5g"> Notebook</a>
 | <a href="/brands/laptops?sid=6bo,b5g">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 TVs: </div> <div class="lastUnit"> <a href="/home-entertainment/tvs/sony~brand/pr?sid=ckf%2Cczl"> Sony TV</a>
 | <a href="/home-entertainment/tvs/samsung~brand/pr?sid=ckf%2Cczl"> Samsung TV</a>
 | <a href="/home-entertainment/tvs/lg~brand/pr?sid=ckf%2Cczl"> LG TV</a>
 | <a href="/home-entertainment/tvs/panasonic~brand/pr?sid=ckf%2Cczl"> Panasonic TV</a>
 | <a href="/home-entertainment/tvs/onida~brand/pr?sid=ckf%2Cczl"> Onida TV</a>
 | <a href="/home-entertainment/tvs/toshiba~brand/pr?sid=ckf%2Cczl"> Toshiba TV</a>
 | <a href="/home-entertainment/tvs/philips~brand/pr?sid=ckf%2Cczl"> Philips TV</a>

问题是大多数html页面被认为是1行,所以在搜索时: grep -F "my text of interest" html_file.htm l - 如果匹配,我会看到整个文件被丢弃 - 这不会让我看到上下文 - 并且调试非常痛苦 考虑一下,我正在搜索在1行中出现的内容,如果在HTML中查看,则不然。

示例 - : 说我需要在这个文件中搜索“/ laptops / samsung~brand / pr?sid = 6bo,b5g”,但是在grep上,如上所述,我看到整个转储(以及更多...) 如何在html中有效地搜索此用例 - 并且仅获取逻辑相邻上下文(grep -A 4 -B 4不直接应用) - 我可以操纵它将文件解释为html然后读取相邻上下文吗?

匹配的示例输出 - :

<a href="/camera-accessories/lenses/pr?sid=jek,6l2,e9y"> Camera Lens</a>
 | <a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
 | <a href="/cameras">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Laptops: </div> <div class="lastUnit"> <a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple Laptop</a>
 | <a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
 | <a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
 | <a href="/q/lenovo-laptops"> Lenovo Laptop</a>
 | <a href="/q/sony-laptops"> Sony Laptop</a>
 | <a href="/q/dell-laptops"> Dell Laptop</a>

最好与突出显示的术语匹配 - “/ laptops / samsung~brand / pr?sid = 6bo,b5g”在这种情况下。

2 个答案:

答案 0 :(得分:3)

使用HTML解析器,而不是使用专门用于处理文本文件的工具,例如Mojo::DOM用于Perl:

use strict;
use warnings;
use feature ":5.10";
use Mojo::DOM;
use List::Util "first";

# construct DOM object from file
my $d = Mojo::DOM->new(do { local $/; <> });

# get all <a> tags
my $a = $d->find("a");                               

# find the index of the one we are interested in
my $href = '/laptops/samsung~brand/pr?sid=6bo,b5g';
my $index = first { $a->[$_]->attr('href') eq $href } 0..$a->size;

# print links
say $a->slice($index-4..$index+4)->map("to_string")->join("\n");

perl script.pl file.html一样运行它。

输出:

<a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
<a href="/cameras">View all</a>
<a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple 
Laptop</a>
<a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
<a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
<a href="/q/lenovo-laptops"> Lenovo Laptop</a>
<a href="/q/sony-laptops"> Sony Laptop</a>
<a href="/q/dell-laptops"> Dell Laptop</a>
<a href="/laptops/asus~brand/pr?sid=6bo,b5g"> Asus Laptop</a>

以下是一些未经测试的建议

为了将选项传递给脚本àlagrep,您可以使用Getopt::Std

use Getopt::Std;
our($opt_A, $opt_B) = (0, 0);
getopts('A:B:');

这样您就可以将选项-A-B传递给脚本,例如perl script.pl -A 4 -B 4 file.html。然后,您可以将上面代码中的硬编码4更改为($index-$opt_A..$index+$opt_B)

要传递模式,您可以指定另一个选项。

要为您感兴趣的行上的输出着色,可以使用Term::ANSIColor

say $a->slice($index-4..$index-1)->map("to_string")->join("\n");
say green $a->[$index]->to_string;
say $a->slice($index+1..$index+4)->map("to_string")->join("\n");

答案 1 :(得分:1)

grep如下

  grep --color -F -A 4 -B 4 '/laptops/samsung~brand/pr?sid=6bo,b5g' 'my_file'

修改

由于所有数据都在一行中找到,您可以使用grep查找字符串周围的字符串和字符

grep -o --color '.\{0,3\}/laptops/samsung~brand/pr?sid=6bo,b5g.\{0,3\}

这将在字符串

之前和之后找到模式并打印3个字符 字符串之前和之后的

.表示除换行符之外的任何字符。