如何从DOM中的页面获取所有链接?

时间:2016-12-02 12:24:45

标签: python python-3.x beautifulsoup

我在Python中使用beautifulsoup来获取所有链接:

links = soup.select('.cover > .card-click-target')
        print(links);

但它给了我一个包含一个元素和字符串值的数组。

我的HTML代码是:

<div class="cover">
  <div class="cover-image-container"> 
    <div class="cover-outer-align"> 
      <div class="cover-inner-align"> 
        <img alt="Kate Mobile Lite" class="cover-image" data-cover-large="" data-cover-small="" src="" aria-hidden="true"> 
      </div>
    </div>
  </div> 
  <a class="card-click-target" href="/s/kate_new_6" aria-label=" Kate Mobile Lite     ">
    <span class="movies preordered-overlay-container id-preordered-overlay-container" style="display:none"> 
      <span class="preordered-label">Предзаказ</span>
    </span> 
    <span class="preview-overlay-container">  </span>  
  </a> 
</div>

<div class="cover"> 
  <div class="cover-image-container">
    <div class="cover-outer-align">
      <div class="cover-inner-align"> 
        <img alt="Kate Mobile Lite" class="cover-image" data-cover-large="" data-cover-small="" src="" aria-hidden="true">
      </div>
    </div> 
  </div> 
  <a class="card-click-target" href="/s/kate_new_6" aria-label=" Kate Mobile Lite     ">
    <span class="movies preordered-overlay-container id-preordered-overlay-container" style="display:none">
      <span class="preordered-label">Предзаказ</span>
    </span>
    <span class="preview-overlay-container"> 
    </span>  
  </a>  
</div>

2 个答案:

答案 0 :(得分:1)

我不完全信任BeautifulSoup中的CSS选择器,只是一个快速搜索,你会发现this answer here谈到更新BeautifulSoup解决了他遇到的问题。

我强烈建议您write a function完成这项工作

links = soup.find_all(lambda tag: tag.parent.get('class', None) == ['cover'] \
                       and tag.get('class', None) == ['card-click-target'])

匿名lambda函数将搜索类card-click-target的所有标记,并确保这些标记具有类cover的父级。

答案 1 :(得分:0)

检查此示例:

    <?xml version="1.0" encoding="UTF-8" ?>
    <Configuration status="TRACE">
        <Properties>
            <Property name="rotateLogsInterval">6</Property>
            <Property name="log.dir">D:\\Mconnect\\LOGGER</Property>
            <Property name="log.INVALIDMNO.dir">D:\\Mconnect\\LOGGER\\INVALIDMNO</Property>
            <Property name="log.MOBINL1000.dir">D:\\Mconnect\\LOGGER\\MOBINL1000</Property>

            <Property name="log.ECONET1000.dir">D:\\Mconnect\\LOGGER\\ECONET1000</Property>
            <Property name="log.AIRTEL1000.dir">D:\\Mconnect\\LOGGER\\AIRTEL1000</Property>
        </Properties>

        <Appenders>
            <Console name="Console" target="SYSTEM_OUT">
                <PatternLayout pattern="%-5p %d [%t] %c: %m%n" />
            </Console>

            <File name="EIGInformation"
                fileName="C:\\EIG_SOURCE_CODE\\EIG_20140901\\logs\\EIGInformation1.log">
                <PatternLayout>
                    <Pattern>%5p | %m%n</Pattern>
                </PatternLayout>
            </File>

            <!-- Debug logger -->
            <RollingRandomAccessFile name="debugLogger"
                fileName="${log.dir}/mconnectDebugLogger.log"
                filePattern="${log.dir}/$${date:yyyy-MM}/mconnectDebugLogger-%d{yyyy-MM-dd-HH}-%i.log.gz">

                <PatternLayout>
                    <Pattern>%5p | %d | %m%n</Pattern>
                </PatternLayout>
                <!-- <DefaultRolloverStrategy> <Delete basePath="${log.dir}" maxDepth="2"> 
                    <IfFileName glob="*/mconnectDebugLogger-*.log.gz" /> <IfLastModified age="60d" 
                    /> </Delete> </DefaultRolloverStrategy> -->
                <Policies>
                    <TimeBasedTriggeringPolicy interval="${rotateLogsInterval}" />
                </Policies>
            </RollingRandomAccessFile>
            <!-- Transaction tdr file -->
            <RollingRandomAccessFile name="transactionDetails"
                fileName="${log.dir}/TDR.log"
                filePattern="${log.dir}/$${date:yyyy-MM}/TDR-%d{yyyy-MM-dd-HH}-%i.log.gz">

                <PatternLayout>
                    <Pattern>%5p | %d | %t:: | %m%n</Pattern>
                </PatternLayout>
                <!-- <DefaultRolloverStrategy> <Delete basePath="${log.dir}" maxDepth="2"> 
                    <IfFileName glob="*/TDR-*.log.gz" /> <IfLastModified age="60d" /> </Delete> 
                    </DefaultRolloverStrategy> -->
                <Policies>
                    <TimeBasedTriggeringPolicy interval="${rotateLogsInterval}" />
                </Policies>
            </RollingRandomAccessFile>

            <!-- Connect Info General log. -->
            <RollingRandomAccessFile name="connectInfoLogGeneral"
                fileName="${log.INVALIDMNO.dir}/connectInfoLogGeneral.log"
                filePattern="${log.INVALIDMNO.dir}/$${date:yyyy-MM}/connectInfoLogGeneral-%d{yyyy-MM-dd-HH}-%i.log.gz">

                <PatternLayout>
                    <Pattern>%5p | %d | %m%n</Pattern>
                </PatternLayout>
                <!-- <DefaultRolloverStrategy> <Delete basePath="${log.INVALIDMNO.dir}" 
                    maxDepth="2"> <IfFileName glob="*/connectInfoLogGeneral-*.log.gz" /> <IfLastModified 
                    age="60d" /> </Delete> </DefaultRolloverStrategy> -->
                <Policies>
                    <TimeBasedTriggeringPolicy interval="${rotateLogsInterval}" />
                </Policies>
            </RollingRandomAccessFile>

            <!-- Connect Process log. -->
            <RollingRandomAccessFile name="connectProcessLogGeneral"
                fileName="${log.INVALIDMNO.dir}/connectProcessLogGeneral.log"
                filePattern="${log.INVALIDMNO.dir}/$${date:yyyy-MM}/connectProcessLogGeneral-%d{yyyy-MM-dd-HH}-%i.log.gz">
                <PatternLayout>
                    <Pattern>%5p | %d | %m%n</Pattern>
                </PatternLayout>
                <!-- <DefaultRolloverStrategy> <Delete basePath="${log.log.INVALIDMNO.dir.dir}" 
                    maxDepth="2"> <IfFileName glob="*/connectProcessLogGeneral-*.log.gz" /> <IfLastModified 
                    age="60d" /> </Delete> </DefaultRolloverStrategy> -->
                <Policies>
                    <TimeBasedTriggeringPolicy interval="${rotateLogsInterval}" />
                </Policies>
            </RollingRandomAccessFile>

            <!-- Connect Info log -->
            <RollingRandomAccessFile name="connectInfoLogMOBINL1000"
                fileName="${log.MOBINL1000.dir}/connectInfoMOBINL1000.log"
                filePattern="${log.MOBINL1000.dir}/$${date:yyyy-MM}/connectInfoMOBINL1000-%d{yyyy-MM-dd-HH}-%i.log.gz">

                <PatternLayout>
                    <Pattern>%5p | %d | %m%n</Pattern>
                </PatternLayout>
                <!-- <DefaultRolloverStrategy> <Delete basePath="${log.MOBINL1000.dir}" 
                    maxDepth="2"> <IfFileName glob="*/connectInfoMOBINL1000-*.log.gz" /> <IfLastModified 
                    age="60d" /> </Delete> </DefaultRolloverStrategy> -->
                <Policies>
                    <TimeBasedTriggeringPolicy interval="${rotateLogsInterval}" />
                </Policies>
            </RollingRandomAccessFile>

            <!-- Connect Process log -->
            <RollingRandomAccessFile name="connectProcessLogMOBINL1000"
                fileName="${log.MOBINL1000.dir}/connectProcessMOBINL1000.log"
                filePattern="${log.MOBINL1000.dir}/$${date:yyyy-MM}/connectProcessMOBINL1000-%d{yyyy-MM-dd-HH}-%i.log.gz">

                <PatternLayout>
                    <Pattern>%5p | %d | %m%n</Pattern>
                </PatternLayout>
                <!-- <DefaultRolloverStrategy> <Delete basePath="${log.MOBINL1000.dir}" 
                    maxDepth="2"> <IfFileName glob="*/connectProcessMOBINL1000-*.log.gz" /> <IfLastModified 
                    age="60d" /> </Delete> </DefaultRolloverStrategy> -->
                <Policies>
                    <TimeBasedTriggeringPolicy interval="${rotateLogsInterval}" />
                </Policies>
            </RollingRandomAccessFile>


        </Appenders>

        <Loggers>

            <!-- CXF is used heavily by Mule for web services -->
            <AsyncLogger name="org.apache.cxf" level="WARN" />

            <!-- Apache Commons tend to make a lot of noise which can clutter the log -->
            <AsyncLogger name="org.apache" level="INFO" />

            <!-- Reduce startup noise -->
            <AsyncLogger name="org.springframework.beans.factory"
                level="WARN" />

            <!-- Mule classes -->
            <AsyncLogger name="org.mule" level="INFO" />
            <AsyncLogger name="com.mulesoft" level="INFO" />


            <AsyncLogger name="EIGInformation" level="INFO">
                <AppenderRef ref="EIGInformation" />
            </AsyncLogger>

            <AsyncLogger
                name="com.comviva.mconnect.webservices.impl.MConnectWebServices"
                level="info">
                <AppenderRef ref="debugLogger" />
            </AsyncLogger>
            <AsyncLogger name="transactionDetails" level="OFF">
                <AppenderRef ref="debugLogger" />
            </AsyncLogger>

            <AsyncLogger name="connectInfoLogGeneral" level="INFO">
                <AppenderRef ref="connectInfoLogGeneral" />
            </AsyncLogger>  
            <AsyncLogger name="connectProcessLogGeneral" level="INFO">
                <AppenderRef ref="connectProcessLogGeneral" />
            </AsyncLogger>

            <AsyncLogger name="connectInfoLogMOBINL1000" level="INFO">
                <AppenderRef ref="connectInfoLogMOBINL1000" />
            </AsyncLogger>
            <AsyncLogger name="connectProcessLogMOBINL1000" level="INFO">
                <AppenderRef ref="connectProcessLogMOBINL1000" />
            </AsyncLogger>

            <AsyncRoot level="INFO">
                <AppenderRef ref="EIGInformation" />
            </AsyncRoot>
        </Loggers>

    </Configuration>



log4j: Using URL [file:/home/contest/prd/muleTomcat/webapps/Connect-1.3.0/WEB-INF/classes/log4j2.xml] for automatic log4j configuration.
log4j: Preferred configurator class: org.apache.log4j.xml.DOMConfigurator
log4j: System property is :null
log4j: Standard DocumentBuilderFactory search succeded.
log4j: DocumentBuilderFactory is: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl

log4j:WARN Continuable parsing error 2 and column 31
log4j:WARN Document root element "Configuration", must match DOCTYPE root "null".
log4j:WARN Document root element "Configuration", must match DOCTYPE root "null".
log4j:WARN Continuable parsing error 2 and column 31
log4j:WARN Document is invalid: no grammar found.log4j:WARN Document is invalid: no grammar found.

log4j:ERROR DOM element is - not a <log4j:configuration> element.
log4j:WARN No appenders could be found for logger (com.mchange.v2.log.MLog).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[localhost-startStop-1] INFO org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean - Building JPA container EntityManagerFactory for persistence