如何使用scrapy从python中获取文本?

时间:2017-10-11 07:02:32

标签: python-2.7 scrapy

<div id="eventInfoContainer">
        <table>
            <tbody><tr>
                <td class="verticalTop">
                    <script type="text/javascript"><!--
                    google_ad_client = "ca-pub-2475575566915822";
                    /* listing page */
                    google_ad_slot = "4647770957";
                    google_ad_width = 160;
                    google_ad_height = 600;
                    //-->
                    </script>
                <script type="text/javascript" src="https://pagead2.googlesyndication.com/pagead/show_ads.js">
                </script><ins id="aswift_0_expand" style="display:inline-table;border:none;height:600px;margin:0;padding:0;position:relative;visibility:visible;width:160px;background-color:transparent;"><ins id="aswift_0_anchor" style="display:block;border:none;height:600px;margin:0;padding:0;position:relative;visibility:visible;width:160px;background-color:transparent;"><iframe width="160" height="600" frameborder="0" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" scrolling="no" allowfullscreen="true" onload="var i=this.id,s=window.google_iframe_oncopy,H=s&amp;&amp;s.handlers,h=H&amp;&amp;H[i],w=this.contentWindow,d;try{d=w.document}catch(e){}if(h&amp;&amp;d&amp;&amp;(!d.body||!d.body.firstChild)){if(h.call){setTimeout(h,0)}else if(h.match){try{h=s.upd(h,i)}catch(e){}w.location.replace(h)}}" id="aswift_0" name="aswift_0" style="left:0;position:absolute;top:0;width:160px;height:600px;"></iframe></ins></ins>
                </td>
                <td class="spacer30w"></td>
                <td class="verticalTop">
            <span id="eventNameHeader">The Future of Medicine, Health Care and Biological Studies</span>
    <br>
    <br>
    <span id="smallerHeading">Conference</span>
    <br>
    <br>
    <span id="eventDate">16th   to 17th October 2017</span>
    <br>
    <span id="eventCountry">Rockville, Maryland, United States of America</span>
    <br>
    <br>
    <span id="eventWebsite">
        <span id="smallerHeading">Website: </span>
        <a href="http://rais.education/the-future-of-medicine-health-care-and-biological-studies/" target="_blank" onclick="trackOutboundLink('http://rais.education/the-future-of-medicine-health-care-and-biological-studies/'); return false;">http://rais.education/the-future-of-medicine-health-care-and-biological-studies/</a>
    </span>
    <br>
    <span id="eventContactPerson"><span id="smallerHeading">Contact person: </span>Eduard David</span>
    <br>
    <br>
    <span id="eventDescription">We gladly invite you to attend the International Conference The Future of Medicine, Health Care and Biological Studies which will be held at Johns Hopkins University, just 20 miles away from Washington DC. </span>
    <br>
    <br>
    <span id="eventOrganiser"><span style="font-weight: bold; color: #696969;">Organized by: </span>Research Association for Interdisciplinary Studies (RAIS)</span>        <br><span id="eventDeadline"><span style="font-weight: bold; color: #696969;">Deadline for abstracts/proposals: </span>21st August 2017</span>        <br>
    <br>
    Check the <a href="http://rais.education/the-future-of-medicine-health-care-and-biological-studies/" target="_blank">event website</a> for more details.
    <br>
    <br>
            <br>
    <br>
    <br>
    <br>
    <table>
        <tbody><tr>
            <td class="verticalMiddle">
                <form><input type="button" value="Back" onclick="history.go(-1); return true;"></form>
            </td>
            <td class="spacer15w"></td>
            <td class="verticalMiddle">
                <a title="Share this conference on Facebook" href="http://www.facebook.com/sharer.php?&#10;&#9;&#9;&#9;&#9;&#9;   s=100&#10;&#9;&#9;&#9;&#9;&#9;   &amp;p[url]=http://www.conferencealerts.com/show-event?id=187457&#9;&#9;&#9;&#9;&#9;   &amp;p[title]=The Future of Medicine, Health Care and Biological Studies&#9;&#9;&#9;&#9;&#9;   &amp;p[summary]=We gladly invite you to attend the International Conference The Future of Medicine, Health Care and Biological Studies which will be held at Johns Hopkins University, just 20 miles away from Washington DC. " target="_blank" class="fb_share_link">Share on Facebook</a>
            </td>
            <td class="spacer15w"></td>
            <td>
                <a href="http://www.google.com/calendar/event?action=TEMPLATE&amp;text=CONFERENCE%3A+6th+The+Future+of+Medicine%2C+Health+Care+and+Biological+Studies&amp;dates=20171016%2F20171017&amp;details=We+gladly+invite+you+to+attend+the+International+Conference+The+Future+of+Medicine%2C+Health+Care+and+Biological+Studies+which+will+be+held+at+Johns+Hopkins+University%2C+just+20+miles+away+from+Washington+DC.+%0D%0AFurther+details%3A+http%3A%2F%2Fwww.conferencealerts.com%2Fshow-event%3Fid%3D187457&amp;location=Rockville%2C+United+States+of+America&amp;trp=false&amp;sprop=http%3A%2F%2Fwww.conferencealerts.com&amp;sprop=name:Conference%20Alerts" target="_blank"><img src="http://www.google.com/calendar/images/ext/gc_button6.gif" border="0" align="left"></a>
            </td>
        </tr>
        <tr><td class="spacer5"></td></tr>
        <tr>
            <td colspan="5">
                <script type="text/javascript"><!--
                    google_ad_client = "ca-pub-2475575566915822";
                    /* show event under content */
                    google_ad_slot = "8943315143";
                    google_ad_width = 300;
                    google_ad_height = 250;
                    //-->
                </script>
                <script type="text/javascript" src="https://pagead2.googlesyndication.com/pagead/show_ads.js">
                </script><ins id="aswift_1_expand" style="display:inline-table;border:none;height:250px;margin:0;padding:0;position:relative;visibility:visible;width:300px;background-color:transparent;"><ins id="aswift_1_anchor" style="display:block;border:none;height:250px;margin:0;padding:0;position:relative;visibility:visible;width:300px;background-color:transparent;"><iframe width="300" height="250" frameborder="0" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" scrolling="no" allowfullscreen="true" onload="var i=this.id,s=window.google_iframe_oncopy,H=s&amp;&amp;s.handlers,h=H&amp;&amp;H[i],w=this.contentWindow,d;try{d=w.document}catch(e){}if(h&amp;&amp;d&amp;&amp;(!d.body||!d.body.firstChild)){if(h.call){setTimeout(h,0)}else if(h.match){try{h=s.upd(h,i)}catch(e){}w.location.replace(h)}}" id="aswift_1" name="aswift_1" style="left:0;position:absolute;top:0;width:300px;height:250px;"></iframe></ins></ins>
            </td>
        </tr>
    </tbody></table>
    <br>

                </td>
            </tr>
        </tbody></table>
</div>

如何使用scrapy从python中的上述代码获取文本“医学,医疗保健和生物学研究的未来”?

我试过这段代码

response.css('div.eventInfoContainer table tbody tr td:nth-child(3) span::text').extract()

但是o / p会像这样“[]”

1 个答案:

答案 0 :(得分:1)

由于包含所需信息的span元素具有id属性(应该是唯一的),这应该足够了:

text = response.css('span#eventNameHeader::text').extract_first()

修改 使用XPath,它是类似的:

text = response.xpath('//span[@id="eventNameHeader"]/text()').extract_first()