用例:我有一个html文件&我需要在其中搜索文本。
说我的文件是 - :
</script> <!-- right side bands --> </div> </div> <div class="fclear"></div> <div class="fk-mainfooter fksk-mainfooter tpadding20 bpadding5 new-vd" id="fk-mainfooter-id"> <div class="fk-content fksk-content bpadding10"> <div class="line tpadding20 bpadding20 footer-dark-top-border"> <div class="unit fk-footer-links-container"> <div class="unit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Help</strong></span> <a class="fk-footer-unit fk-footer-link" href="/s/help/payments">Payments</a> <a class="fk-footer-unit fk-footer-link" href="/help/savedcard_how">Saved Cards</a> <a class="fk-footer-unit fk-footer-link" href="/s/help/shipping">Shipping</a> <a class="fk-footer-unit fk-footer-link" href="/s/help/cancellation-returns">Cancellation &
Returns</a> <a class="fk-footer-unit fk-footer-link" href="/s/help">FAQ</a> <a class="fk-footer-unit fk-footer-link" href="https://seller.flipkart.com/fiv">Report Infringement</a> </div> <div class="unit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Flipkart</strong></span> <a class="fk-footer-unit fk-footer-link" href="/s/contact">Contact Us</a> <a class="fk-footer-unit fk-footer-link" href="/about-us">About Us</a> <a class="fk-footer-unit fk-footer-link" target="_blank" href="/ol?link=http%3A%2F%2Fflipkartcareers.com%2F">Careers</a> <a class="fk-footer-unit fk-footer-link" href="/ol?link=http%3A%2F%2Fblog.flipkart.com%2F">Blog</a> <a class="fk-footer-unit fk-footer-link" href="/s/press">Press</a> <a class="fk-footer-unit fk-footer-link" target="_blank" href="http://slashn.flipkart.net/">Slash N</a> </div> <div class="unit size1of4 required-tracking" data-tracking-id="ch_vn"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Flipkart eBooks</strong></span> <a class="fk-footer-unit fk-footer-link" href="/ebooks/gettingstarted" target="_blank">eBooks Quick Start Guide</a> <a class="fk-footer-unit fk-footer-link" href="/help/flyteeBookfaq" target="_blank">eBooks FAQ</a> <a class="fk-footer-unit fk-footer-link" href="/ebooks/apps" target="_blank">eBooks App</a> <a href="/mobile-apps" data-tracking-id="mobile_apps"><div class="footer-mobile-apps lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png"></div></a> </div> <div class="lastUnit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Misc</strong></span> <a class="fk-footer-unit fk-footer-link" href="http://www.flipkart.com">Online Shopping</a> <a class="fk-footer-unit fk-footer-link" href="/affiliate/" target="_blank">Affiliate</a> <a class="fk-footer-unit fk-footer-link" href="/buy-gift-voucher">e-Gift Voucher</a> <a class="fk-footer-unit fk-footer-link" href="/?sitevariant=mobile">Flipkart lite</a> <a class="fk-footer-unit fk-footer-link" href="/flipkart-first">Flipkart First Subscription</a> <a class="fk-footer-unit fk-footer-link" href="/elearning-faq">eLearning FAQ</a> </div> </div> <div class="lastUnit size10f4 lpadding15 fk-trust-boosters"> <p class="fk-footer-sub-head"><strong>Safe and Secure Shopping</strong></p> <p class="bpadding15 fk-trust-content">All major credit and debit cards are accepted. We also accept payments by <strong>Internet Banking, Cash on Delivery</strong> and <strong>Equated Monthly Installments(EMI).</strong></p> </div> </div> <div class="fk-footer-ssa"> <a href="/account/orders?srcLink=footer" class="login-required"> <ul class="line ssa-block"> <li class="unit size1of3 ssa-unit"><i class="icon track-icon"></i><span class="text">Track your<br /> order</span></li> <li class="unit size1of3 ssa-unit"><i class="icon return-icon"></i><span class="text">Free & easy<br /> returns</span></li> <li class="lastUnit ssa-unit"><i class="icon cancel-icon"></i><span class="text">Online<br /> cancellations</span></li> </ul> </a> </div> <div class="line fk-footer-policy"> <div class="unit tpadding5 tc-links"> <span><span class="policies-title boldtext">Policies:</span> <a href="/s/terms">Terms of use</a> | <a href="/s/paymentsecurity">Security</a> | <a
href="/s/privacypolicy">Privacy</a> | <a
href="https://seller.flipkart.com/fiv">Infringement</a></span> <span class="fk-footet-cr">© 2007-2014 <span>Flipkart.com.</span></span> </div> <div class="fk-footer-kit unitExt fk-inline-block"> <strong class="title fk-float-left">Keep in touch</strong> <a class="facebook_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.facebook.com%2Fflipkart"></a> <a class="twitter_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.twitter.com%2Fflipkart"></a> <a class="google-plus_icn inner fk-sprite-hf rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=https%3A%2F%2Fplus.google.com%2F109591199284807005836%2Fposts"></a> <a class="youtube_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.youtube.com%2Fflipkart"></a> </div> </div> <div class="line top-brand-links tpadding10 bpadding10"> <div class="line boldtext bpadding10 top-brands-title">
Top Stores : <a href="/brands">Brand Directory</a> | <a href="/store-directory">Store Directory</a> </div> <div class="line"> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
Most searched for on Flipkart: </div> <div class="lastUnit"> <a href="/games/call-of-duty~series/pr?sid=4rr,tg9"> Call Of Duty</a>
| <a href="/androidone"> Android One</a>
| <a href="/offers"> Diwali Offers</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
Mobiles: </div> <div class="lastUnit"> <a href="/moto-e/p/itmdvuwsybgnbtha"> Moto E</a>
| <a href="/q/samsung-mobiles"> Samsung Mobile</a>
| <a href="/q/micromax-mobiles"> Micromax Mobile</a>
| <a href="/q/nokia-mobiles"> Nokia Mobile</a>
| <a href="/q/htc-mobiles"> HTC Mobile</a>
| <a href="/q/sony-mobiles"> Sony Mobile</a>
| <a href="/q/apple-mobiles"> Apple Mobile</a>
| <a href="/q/lg-mobiles"> LG Mobile</a>
| <a href="/q/karbonn-mobiles"> Karbonn Mobile</a>
| <a href="/mobiles">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
Camera: </div> <div class="lastUnit"> <a href="/q/nikon-cameras"> Nikon Camera</a>
| <a href="/q/canon-cameras"> Canon Camera</a>
| <a href="/q/sony-cameras"> Sony Camera</a>
| <a href="/cameras/samsung~brand/pr?sid=jek,p31"> Samsung Camera</a>
| <a href="/q/canon-dslr-cameras"> Canon DSLR</a>
| <a href="/q/nikon-dslr-cameras"> Nikon DSLR</a>
| <a href="/cameras/dslr~type/pr?sid=jek,p31"> DSLR Camera</a>
| <a href="/camera-accessories/lenses/pr?sid=jek,6l2,e9y"> Camera Lens</a>
| <a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
| <a href="/cameras">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
Laptops: </div> <div class="lastUnit"> <a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple Laptop</a>
| <a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
| <a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
| <a href="/q/lenovo-laptops"> Lenovo Laptop</a>
| <a href="/q/sony-laptops"> Sony Laptop</a>
| <a href="/q/dell-laptops"> Dell Laptop</a>
| <a href="/laptops/asus~brand/pr?sid=6bo,b5g"> Asus Laptop</a>
| <a href="/laptops/toshiba~brand/pr?sid=6bo,b5g"> Toshiba Laptop</a>
| <a href="/laptops/lg~brand/pr?sid=6bo,b5g"> LG Laptop</a>
| <a href="/q/hp-laptops"> HP Laptop</a>
| <a href="/laptops/~notebook/pr?sid=6bo,b5g"> Notebook</a>
| <a href="/brands/laptops?sid=6bo,b5g">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
TVs: </div> <div class="lastUnit"> <a href="/home-entertainment/tvs/sony~brand/pr?sid=ckf%2Cczl"> Sony TV</a>
| <a href="/home-entertainment/tvs/samsung~brand/pr?sid=ckf%2Cczl"> Samsung TV</a>
| <a href="/home-entertainment/tvs/lg~brand/pr?sid=ckf%2Cczl"> LG TV</a>
| <a href="/home-entertainment/tvs/panasonic~brand/pr?sid=ckf%2Cczl"> Panasonic TV</a>
| <a href="/home-entertainment/tvs/onida~brand/pr?sid=ckf%2Cczl"> Onida TV</a>
| <a href="/home-entertainment/tvs/toshiba~brand/pr?sid=ckf%2Cczl"> Toshiba TV</a>
| <a href="/home-entertainment/tvs/philips~brand/pr?sid=ckf%2Cczl"> Philips TV</a>
问题是大多数html页面被认为是1行,所以在搜索时:
grep -F "my text of interest" html_file.htm
l - 如果匹配,我会看到整个文件被丢弃 - 这不会让我看到上下文 - 并且调试非常痛苦
考虑一下,我正在搜索在1行中出现的内容,如果在HTML中查看,则不然。
示例 - :
说我需要在这个文件中搜索“/ laptops / samsung~brand / pr?sid = 6bo,b5g”,但是在grep上,如上所述,我看到整个转储(以及更多...)
如何在html中有效地搜索此用例 - 并且仅获取逻辑相邻上下文(grep -A 4 -B 4
不直接应用) - 我可以操纵它将文件解释为html然后读取相邻上下文吗?
匹配的示例输出 - :
<a href="/camera-accessories/lenses/pr?sid=jek,6l2,e9y"> Camera Lens</a>
| <a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
| <a href="/cameras">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
Laptops: </div> <div class="lastUnit"> <a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple Laptop</a>
| <a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
| <a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
| <a href="/q/lenovo-laptops"> Lenovo Laptop</a>
| <a href="/q/sony-laptops"> Sony Laptop</a>
| <a href="/q/dell-laptops"> Dell Laptop</a>
最好与突出显示的术语匹配 - “/ laptops / samsung~brand / pr?sid = 6bo,b5g”在这种情况下。
答案 0 :(得分:3)
使用HTML解析器,而不是使用专门用于处理文本文件的工具,例如Mojo::DOM
用于Perl:
use strict;
use warnings;
use feature ":5.10";
use Mojo::DOM;
use List::Util "first";
# construct DOM object from file
my $d = Mojo::DOM->new(do { local $/; <> });
# get all <a> tags
my $a = $d->find("a");
# find the index of the one we are interested in
my $href = '/laptops/samsung~brand/pr?sid=6bo,b5g';
my $index = first { $a->[$_]->attr('href') eq $href } 0..$a->size;
# print links
say $a->slice($index-4..$index+4)->map("to_string")->join("\n");
像perl script.pl file.html
一样运行它。
输出:
<a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
<a href="/cameras">View all</a>
<a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple
Laptop</a>
<a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
<a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
<a href="/q/lenovo-laptops"> Lenovo Laptop</a>
<a href="/q/sony-laptops"> Sony Laptop</a>
<a href="/q/dell-laptops"> Dell Laptop</a>
<a href="/laptops/asus~brand/pr?sid=6bo,b5g"> Asus Laptop</a>
为了将选项传递给脚本àlagrep,您可以使用Getopt::Std
:
use Getopt::Std;
our($opt_A, $opt_B) = (0, 0);
getopts('A:B:');
这样您就可以将选项-A
和-B
传递给脚本,例如perl script.pl -A 4 -B 4 file.html
。然后,您可以将上面代码中的硬编码4
更改为($index-$opt_A..$index+$opt_B)
。
要传递模式,您可以指定另一个选项。
要为您感兴趣的行上的输出着色,可以使用Term::ANSIColor
:
say $a->slice($index-4..$index-1)->map("to_string")->join("\n");
say green $a->[$index]->to_string;
say $a->slice($index+1..$index+4)->map("to_string")->join("\n");
答案 1 :(得分:1)
grep如下
grep --color -F -A 4 -B 4 '/laptops/samsung~brand/pr?sid=6bo,b5g' 'my_file'
修改强>
由于所有数据都在一行中找到,您可以使用grep查找字符串周围的字符串和字符
grep -o --color '.\{0,3\}/laptops/samsung~brand/pr?sid=6bo,b5g.\{0,3\}
这将在字符串
之前和之后找到模式并打印3个字符 字符串之前和之后的 .
表示除换行符之外的任何字符。