在<a> element using WWW::Mechanize

时间:2017-06-20 13:46:24

标签: perl hyperlink www-mechanize

I'm extracting special links within an HTML page by using WWW::Mechanize

中获取HTML
my $mech = WWW::Mechanize->new();

$mech->get( $uri );

my @links = $mech->find_all_links(url_regex => qr/cgi-bin/);

for my $link ( @links ) {
    # try to get everything between <a href="[...]">HERE</a>
}

链接看起来像这样

<a href="[...]"><div><div><span>foo bar</span> I WANT THIS TEXT</div></div></a>

使用$link->text我得到foo bar I WANT THIS TEXT,却不知道<span>元素中包含哪些文字。

有没有办法获取原始HTML代码而不是剥离文本?

换句话说,我需要找到一种方法,只有I WANT THIS TEXT才能知道<span>标记内的确切文字。

1 个答案:

答案 0 :(得分:2)

作为simbabque has said,您无法使用# Pull base image of centos. FROM centos:7.3.1611 # install basics RUN yum -y update && yum -y install wget \ gcc \ gcc-c++ \ kernel-devel \ make \ mlocate \ sudo \ curl \ rsync \ tar \ perl \ perl-core \ ansible \ git \ net-tools \ which \ cpan \ libcurl-devel \ python-devel \ texlive-lastpage \ texlive-misc.noarch \ texlive \ man \ epel-release \ ncurses-devel \ zlib-devel \ texinfo \ gtk+-devel \ gtk2-devel \ qt-devel \ tcl-devel \ tk-devel \ kernel-headers \ kernel-devel \ openssl-devel \ openssl \ libidn-devel \ mysql \ mysql-devel \ mysql-lib \ perl-devel \ perl-CPAN \ perl-JSON \ perl-App-cpanminus \ zlib RUN yum -y groupinstall "Development Tools" # create folders to install perl5 libs locally and for VEP locations RUN mkdir -p /my-software/perl ; \ mkdir -p /my-software/perl/lib/perl5 ; \ # set perl environment variables ENV PERL_PATH=/my-software/perl/ ENV PERL5LIB=$PERL_PATH:$PERL_PATH/lib/perl5:$PERL5LIB ENV PERL_MM_OPT="INSTALL_BASE=$PERL_PATH" ENV PERL_MB_OPT="--install_base $PERL_PATH" ENV PATH="$PERL_PATH/bin:$PATH" # install samtools + dependancies RUN mkdir /my-software/ ; \ curl -L -o htslib-1.2.1.tar.gz https://github.com/samtools/htslib/archive/1.2.1.tar.gz ; \ curl -L -o samtools-1.2.tar.gz https://github.com/samtools/samtools/archive/1.2.tar.gz ; \ tar xzf htslib-1.2.1.tar.gz ; \ tar xzf samtools-1.2.tar.gz ; \ rm -rf htslib-1.2.1.tar.gz ; \ rm -rf samtools-1.2.tar.gz ; \ mv htslib-1.2.1 /my-software/htslib ; \ mv samtools-1.2 /my-software/samtools ; \ cd /my-software/htslib ; \ make install ; \ cd /my-software/samtools ; \ make -j HTSDIR=/my-software/htslib ; \ make prefix=/my-software/usr/local/bin/ install ; \ rm -rf /htslib* /samtools-1.2* # Handle VEP's Perl dependencies using cpanminus to install them under $PERL_PATH: RUN cpanm --notest -l $PERL_PATH \ Net::SSLeay \ IO::Socket::SSL \ LWP::Simple \ LWP::Protocol::https \ Archive::Extract \ Archive::Tar \ Archive::Zip \ CGI \ DBI \ DBD::mysql \ Time::HiRes # Download the v79 release of VEP: RUN mkdir -p /my-software/vep ; \ cd /my-software/vep ; \ wget "https://github.com/Ensembl/ensembl-tools/archive/release/79.zip" ; \ unzip 79.zip ; \ cd /my-software/vep/ensembl-tools-release-79/scripts/variant_effect_predictor/ ; \ perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh38

执行此操作

事实上,如果你不想要它的任何功能,使用WWW::Mechanize的要点非常小。如果你正在使用它是为了获取网页,那么请改用LWP::UserAgentWWW::Mechanize只是WWW::Mechanize的一个子类,其中包含许多您不想要的其他内容

这是一个使用HTML::TreeBuilder构建HTML解析树并找到所需链接的示例。我已经使用了LWP::UserAgent,因为它能够以类似于现代浏览器的方式容忍格式错误的HTML

由于您还没有提供适当的样本数据而且我不打算创建自己的

,因此我无法对其进行测试
HTML::TreeBuilder