I'm extracting special links within an HTML page by using WWW::Mechanize
。
my $mech = WWW::Mechanize->new();
$mech->get( $uri );
my @links = $mech->find_all_links(url_regex => qr/cgi-bin/);
for my $link ( @links ) {
# try to get everything between <a href="[...]">HERE</a>
}
链接看起来像这样
<a href="[...]"><div><div><span>foo bar</span> I WANT THIS TEXT</div></div></a>
使用$link->text
我得到foo bar I WANT THIS TEXT
,却不知道<span>
元素中包含哪些文字。
有没有办法获取原始HTML代码而不是剥离文本?
换句话说,我需要找到一种方法,只有I WANT THIS TEXT
才能知道<span>
标记内的确切文字。
答案 0 :(得分:2)
作为simbabque has said,您无法使用# Pull base image of centos.
FROM centos:7.3.1611
# install basics
RUN yum -y update && yum -y install wget \
gcc \
gcc-c++ \
kernel-devel \
make \
mlocate \
sudo \
curl \
rsync \
tar \
perl \
perl-core \
ansible \
git \
net-tools \
which \
cpan \
libcurl-devel \
python-devel \
texlive-lastpage \
texlive-misc.noarch \
texlive \
man \
epel-release \
ncurses-devel \
zlib-devel \
texinfo \
gtk+-devel \
gtk2-devel \
qt-devel \
tcl-devel \
tk-devel \
kernel-headers \
kernel-devel \
openssl-devel \
openssl \
libidn-devel \
mysql \
mysql-devel \
mysql-lib \
perl-devel \
perl-CPAN \
perl-JSON \
perl-App-cpanminus \
zlib
RUN yum -y groupinstall "Development Tools"
# create folders to install perl5 libs locally and for VEP locations
RUN mkdir -p /my-software/perl ; \
mkdir -p /my-software/perl/lib/perl5 ; \
# set perl environment variables
ENV PERL_PATH=/my-software/perl/
ENV PERL5LIB=$PERL_PATH:$PERL_PATH/lib/perl5:$PERL5LIB
ENV PERL_MM_OPT="INSTALL_BASE=$PERL_PATH"
ENV PERL_MB_OPT="--install_base $PERL_PATH"
ENV PATH="$PERL_PATH/bin:$PATH"
# install samtools + dependancies
RUN mkdir /my-software/ ; \
curl -L -o htslib-1.2.1.tar.gz https://github.com/samtools/htslib/archive/1.2.1.tar.gz ; \
curl -L -o samtools-1.2.tar.gz https://github.com/samtools/samtools/archive/1.2.tar.gz ; \
tar xzf htslib-1.2.1.tar.gz ; \
tar xzf samtools-1.2.tar.gz ; \
rm -rf htslib-1.2.1.tar.gz ; \
rm -rf samtools-1.2.tar.gz ; \
mv htslib-1.2.1 /my-software/htslib ; \
mv samtools-1.2 /my-software/samtools ; \
cd /my-software/htslib ; \
make install ; \
cd /my-software/samtools ; \
make -j HTSDIR=/my-software/htslib ; \
make prefix=/my-software/usr/local/bin/ install ; \
rm -rf /htslib* /samtools-1.2*
# Handle VEP's Perl dependencies using cpanminus to install them under $PERL_PATH:
RUN cpanm --notest -l $PERL_PATH \
Net::SSLeay \
IO::Socket::SSL \
LWP::Simple \
LWP::Protocol::https \
Archive::Extract \
Archive::Tar \
Archive::Zip \
CGI \
DBI \
DBD::mysql \
Time::HiRes
# Download the v79 release of VEP:
RUN mkdir -p /my-software/vep ; \
cd /my-software/vep ; \
wget "https://github.com/Ensembl/ensembl-tools/archive/release/79.zip" ; \
unzip 79.zip ; \
cd /my-software/vep/ensembl-tools-release-79/scripts/variant_effect_predictor/ ; \
perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh38
事实上,如果你不想要它的任何功能,使用WWW::Mechanize
的要点非常小。如果你正在使用它是为了获取网页,那么请改用LWP::UserAgent
。 WWW::Mechanize
只是WWW::Mechanize
的一个子类,其中包含许多您不想要的其他内容
这是一个使用HTML::TreeBuilder
构建HTML解析树并找到所需链接的示例。我已经使用了LWP::UserAgent
,因为它能够以类似于现代浏览器的方式容忍格式错误的HTML
由于您还没有提供适当的样本数据而且我不打算创建自己的
,因此我无法对其进行测试HTML::TreeBuilder