scrapy shell http://www.zvon.org/comp/r/tut-XPath_1.html
response.css("div.description")
response.xpath('//div[@class="description"]')
我是scrapy的新手,当我想自己写一个蜘蛛时,我试图抓取来自http://www.zvon.org/comp/r/tut-XPath_1.html的文字,包括说明文字和正确的条形文字,以便制作下一页url,我花了5个小时,但是我没能写出正确的CSS或Xpath,比如xpath的
<div class="right_menu_body_item">List of XPaths</div>
和
<div class="description">XPath is described in <a href="http://www.w3.org/TR/xpath" target="_blank" id="cglh" title="XPath 1.0 standard">XPath 1.0 standard</a>.
任何人都可以帮忙吗?谢谢!
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-15189975-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(ga);
})();
</script>
<div id="page">
<div id="top"><h1 class="top">XPath 1.0 Tutorial</h1>
</div>
<div id="right" style="width: 230px;"><div style="width:234px; margin-top:10px; height:60px;background:url(http://www.highposition.net/embedded/img/234x60-hpbg.png);color#fff" id="hpban">
<div style="padding:9px;padding-left:78px;font-family:arial;color:#fff;font-size:11px;">For a flurry of SEO tips, tricks, articles and advice - visit <a href="http://www.hpgroup-seo.co.uk" rel="nofollow">HP Group</a>.</div></div>
<div id="right_menu_header">
<div class="right_menu_header_item right_menu_header_item_selected">
Pages
<span id="header_count_Pages" style="font-style:italic; font-weight:normal">(23)</span></div>
<div class="right_menu_header_item">
Keywords
<span id="header_count_Keywords" style="font-style:italic; font-weight:normal">(34)</span></div>
<div id="filter_div">filter: <input name="right_menu_filter" id="right_menu_filter"></div>
<div class="filter_div_comment"><input name="regexpEnabled" id="regexpEnabled" type="checkbox">enable regexp (<a href="/comp/r/zvon.html#Help~Filter">?</a>)</div>
</div>
<div id="right_menu_body"><div class="pn_right_menu_body_ttt"><span class="right_menu_body_first_passive">First</span> - <span class="right_menu_body_prev_passive">Prev</span> - <span class="right_menu_body_next">Next</span></div>
<div id="right_menu_body_head">
1
-
20
<span style="color:red; font-weight:bold">filter: off</span> (23)
</div>
**<div class="right_menu_body_item">List of XPaths</div>**
<div class="right_menu_body_item">XPath as filesystem addressing</div>
<div class="right_menu_body_item">Start with //</div>
<div class="right_menu_body_item">All elements: *</div>
<div class="right_menu_body_item">Further conditions inside []</div>
<div class="right_menu_body_item">Attributes</div>
<div class="right_menu_body_item">Attribute values</div>
<div class="right_menu_body_item">Nodes counting</div>
<div class="right_menu_body_item">Playing with names of selected elements</div>
<div class="right_menu_body_item">Length of string</div>
<div class="right_menu_body_item">Combining XPaths with |</div>
<div class="right_menu_body_item">Child axis</div>
<div class="right_menu_body_item">Descendant axis</div>
<div class="right_menu_body_item">Parent axis</div>
<div class="right_menu_body_item">Ancestor axis</div>
<div class="right_menu_body_item">Following-sibling axis</div>
<div class="right_menu_body_item">Preceding-sibling axis</div>
<div class="right_menu_body_item">Following axis</div>
<div class="right_menu_body_item">Preceding axis</div>
<div class="right_menu_body_item">Descendant-or-self axis</div>
<div class="pn_right_menu_body_bbb"><span class="right_menu_body_first_passive">First</span> - <span class="right_menu_body_prev_passive">Prev</span> - <span class="right_menu_body_next">Next</span></div></div></div>
<div id="left">
<div id="search_div"><div><input id="search_input" name="search_input" value="...loading..."> <a href="http://fusion.google.com/add?source=atgs&moduleurl=http%3A//zvon.org/gadgets/zvon_keywords.xml"><img id="plus_google" src="http://gmodules.com/ig/images/plus_google.gif" style="margin:2px" alt="Add to Google" border="0"></a><div id="search_input_text"></div></div><div id="result_div"></div></div>
<div id="hint_div">
⇒ interactive index to zvon materials
</div>
<div id="category_logo_div">
<table id="category-table">
<tbody><tr>
<td id="category-switch">
<img src="/shared/png/comp.png" height="66" width="70">
</td>
<td id="category-switch-links">
<div class="category-div">
<a href="/" id="switch-comp" class="switch-selected">
comp
<img src="/shared/png/comp_small.png" title="computing resources" style="display: none;" height="15" width="16">
</a>
</div>
<div class="category-div">
<a href="/law" id="switch-law">
law
<img src="/shared/png/law_small.png" title="international law documents" height="15" width="16">
</a>
</div>
<div class="category-div">
<a href="/lib" id="switch-lib">
lib
<img src="/shared/png/lib_small.png" title="resources for librarians" height="15" width="16">
</a>
</div>
<div class="category-div">
<a href="/eco" id="switch-eco">
eco
<img src="/shared/png/eco_small.png" title="eco resources" height="15" width="16">
</a>
</div>
</td>
</tr>
</tbody></table>
</div>
<div id="center" style="width: 500px;">
<div id="noscript" style="display: none;"><div id="noscript_intro">XPath is described in <a href="http://www.w3.org/TR/xpath" target="_blank" id="cglh" title="XPath 1.0 standard">XPath 1.0 standard</a>. In this tutorial selected XPath features are demonstrated on many examples.<br> <br> <div> <b>Standard excerpt:</b> </div> <blockquote class="webkit-indent-blockquote" style="BORDER:none;MARGIN:0 0 0 40px"> <div> XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations and XPointer. The primary purpose of XPath is to address parts of an XML document. In support of this primary purpose, it also provides basic facilities for manipulation of strings, numbers and booleans. XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document. </div> </blockquote> <br> Zvon offers other <a href="/comp/m/xpath.html" target="_blank" title="XPath related materials">XPath related materials</a>.<br> <br> <b><br> </b> <div> <b>Prepared by:</b> Miloslav Nic (Mila)<span id="nicmila_details"></span> </div> <br></div></div>
<div id="center_top"></div>
<div id="center_middle"><h1 id="browser_title_line">XPath 1.0 Tutorial</h1><div id="prevNextDiv"><span id="backPageSpanPassive">Back</span>|<span id="forwardPageSpanPassive">Forward</span>||<span id="prevPageSpanPassive">Previous</span>|<span id="nextPageSpan">Next</span></div>**<div class="description">XPath is described in <a href="http://www.w3.org/TR/xpath" target="_blank" id="cglh" title="XPath 1.0 standard">XPath 1.0 standard</a>. In this tutorial selected XPath features are demonstrated on many examples.<br> <br> <div> <b>Standard excerpt:</b> </div> <blockquote class="webkit-indent-blockquote" style="BORDER:none;MARGIN:0 0 0 40px"> <div></div> </blockquote> <br> Zvon offers other <a href="/comp/m/xpath.html" target="_blank" title="XPath related materials">XPath related materials</a> XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations and XPointer. The primary purpose of XPath is to address parts of an XML document. In support of this primary purpose, it also provides basic facilities for manipulation of strings, numbers and booleans. XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document. .<br> <br> <b><br> </b>** <div> <b>Prepared by:</b> Miloslav Nic (Mila)<span id="nicmila_details"></span> </div> <br></div><div id="prevNextDivBottom"><span id="prevPageSpanPassive">Previous</span>|<span id="nextPageSpan">Next</span></div></div>
<div id="center_bottom"><h2 class="bottom">XPath 1.0 Tutorial</h2><div id="front_keywords"><i>keywords</i>: <a href="/comp/m/programming.html">programming</a>, <a href="/comp/m/tutorial.html">tutorial</a>, <a href="/comp/m/xml.html">XML</a>, <a href="/comp/m/xpath.html">XPath</a></div> </div>
</div>
<div id="bottom"></div>
<div id="example_div">
<div id="example_menu_div" class="windowMenu">
<span id="close_example_span" class="windowMenuButton">x</span>
<span id="example_title_text" class="windowMenuText"></span>
</div>
<div id="example_body_div"></div>
</div>
</div>
<script type="text/javascript" src="http://www.google.com/jsapi"></script>
<script type="text/javascript">google.load("jquery", "1");</script><script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"></script>
<!--script src="/Javascript/jquery.min.js"></script-->
<script type="text/javascript" src="/Javascript/zvon.js"></script>
<script type="text/javascript">release="20100406"</script>
<script type="text/javascript">indexes={"_gadget": false, "_examples": [], "_display_format": {"Keywords": {"tp": "keyword", "title": ["name"]}, "Pages": {"tp": "page", "title": ["name"]}}, "_indexes": [["Pages", "page"], ["Keywords", "keyword"]], "Pages": ["List of XPaths", "XPath as filesystem addressing", "Start with //", "All elements: *", "Further conditions inside []", "Attributes", "Attribute values", "Nodes counting", "Playing with names of selected elements", "Length of string", "Combining XPaths with |", "Child axis", "Descendant axis", "Parent axis", "Ancestor axis", "Following-sibling axis", "Preceding-sibling axis", "Following axis", "Preceding axis", "Descendant-or-self axis", "Ancestor-or-self axis", "Orthogonal axes", "Numeric operations"], "_matID": "tut-XPath_1", "Keywords": ["", ">", "<", "*", "/", "//", "=", "@", "[]", "absolute path", "ancestor", "attribute", "axis", "ceiling", "child", "contains", "count", "descendant", "div", "division", "floor", "following", "last", "name", "normalize-space", "not", "parent", "preceding", "self", "sibling", "starts-with", "string", "string-length", "|"], "_title": "XPath 1.0 Tutorial"}</script>
<!-- script src="/Javascript/zvon_browser.js"></script>
<script src="/Javascript/zvon_xmlbrowser.js"></script -->
<!--script type="text/javascript">
$.get('http://c.zvon.org/counter/'+encodeURIComponent(window.location));
</script-->
<div id="dynamic_div" style="top: 100.133px; left: 259.5px;">
<div id="dynamic_menu_div" class="windowMenu">
<span id="close_dynamic_span" class="windowMenuButton">x</span>
<span id="dynamic_title_text" class="windowMenuText"></span>
</div>
<div id="inpDiv">
<div id="dynamic_pictogram" style="background-image: url("/shared/png/comp_small.png"); background-repeat: no-repeat;"></div>
<span id="inpStarts">
<input name="inp" value="start" checked="checked" type="radio">
starts with
</span>
<span id="inpContains" class="disabled">
<input name="inp" value="contains" type="radio">
contains
<span id="inpContains3chars"> (at least 3 characters needed)</span>
</span>
</div>
<div id="dynamic_body_div">
</div>
</div>
<img id="key" src="/shared/png/key.png" style="top: 100.133px; left: 119.5px;">
<!-- div id='adsense_right_top'>
<script type="text/javascript"><!- -
google_ad_client = "pub-8853328679404934";
/* refrences_top */
google_ad_slot = "9999918284";
google_ad_width = 234;
google_ad_height = 60;
//-->
<!--/script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
</div -->
&#13;
答案 0 :(得分:1)
您应该在浏览器中禁用javascript,因为scrapy不会呈现javascript,然后检查源:
scrapy shell http://www.zvon.org/comp/r/tut-XPath_1.html
# disable javascript in your browser and:
view_response(response)
# now inspect the body for your fields
#i.e. this `response.css("div.description")` turns into:
response.css('div#noscript_intro')