使用Scrapy在下一个兄弟标记中获取信息的Xpath

时间:2014-11-26 00:27:00

标签: python html xpath scrapy

我试图抓住Scrapy,现在我尝试从词源网站中提取信息:http://www.etymonline.com 现在,我只想得到这些文字及其原始描述。这就是通常的HTML代码块在etymonline中的呈现方式:

<dt><a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a> <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com"><img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com" /></a></dt> <dd>1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).</dd>

该字词包含在<dt>标记和下一个兄弟<dd>标记中的说明中。 要获取http://www.etymonline.com/index.php?l=a&p=9&allowed_in_frame=0等网页上的字词列表,可以编写word = sel.xpath('//dl/dt/a/text()').extract()

然后我尝试遍历这个单词列表,并使用这行代码info = selInfo.xpath("//dl/dt[a='"+word[i]+"']/following-sibling::dd")提取相关信息。但它似乎不起作用。有什么想法吗?

3 个答案:

答案 0 :(得分:3)

要在<dd>后到达<dt>,您可以使用following-sibling轴,这是正确的。

following-sibling::dd,在上下文节点后选择所有dd个元素。因此,您需要使用位置谓词[1]将XPath限制为仅第一个。

对于dt中的每个//dl/dt元素,您选择following-sibling::dd[1]

以下是使用scrapy shell用于术语“地址”的示例会话:

$ scrapy shell "http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none"
...
2014-11-26 10:34:53+0100 [default] DEBUG: Crawled (200) <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f1396cc6950>
[s]   item       {}
[s]   request    <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s]   response   <200 http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s]   settings   <scrapy.settings.Settings object at 0x7f1397399bd0>
[s]   spider     <Spider 'default' at 0x7f13966c05d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: for dt in response.xpath('//dl/dt'):
    print "Word:", dt.xpath('string(a)').extract()
    print "Definition:", dt.xpath('string(following-sibling::dd[1])').extract()
    print
   ...:     
Word: [u'address (n.)']
Definition: [u'1530s, "dutiful or courteous approach," from address (v.) and from French adresse. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']

Word: [u'addressee (n.)']
Definition: [u'1810; see address (v.) + -ee.']

Word: [u'address (v.)']
Definition: [u'early 14c., "to guide or direct," from Old French adrecier "go straight toward; straighten, set right; point, direct" (13c.), from Vulgar Latin *addirectiare "make straight," from Latin ad "to" (see ad-) + *directiare, from Latin directus "straight, direct" (see direct (v.)). Late 14c. as "to set in order, repair, correct." Meaning "to write as a destination on a written message" is from mid-15c. Meaning "to direct spoken words (to someone)" is from late 15c. Related: Addressed; addressing.']

Word: [u'salutatorian (n.)']
Definition: [u'1841, American English, from salutatory "of the nature of a salutation," here in the specific sense "designating the welcoming address given at a college commencement" (1702) + -ian. The address was originally usually in Latin and given by the second-ranking graduating student.']

...

Word: [u'reverend (adj.)']
Definition: [u'early 15c., "worthy of respect," from Middle French reverend, from Latin reverendus "(he who is) to be respected," gerundive of revereri (see reverence). As a form of address for clergymen, it is attested from late 15c.; earlier reverent (late 14c. in this sense). Abbreviation Rev. is attested from 1721, earlier Revd. (1690s). Very Reverend is used of deans, Right Reverend of bishops, Most Reverend of archbishops.']

Word: [u'nun (n.)']
Definition: [u'Old English nunne "nun, vestal, pagan priestess, woman devoted to religious life under vows," from Late Latin nonna "nun, tutor," originally (along with masc. nonnus) a term of address to elderly persons, perhaps from children\'s speech, reminiscent of nana (compare Sanskrit nona, Persian nana "mother," Greek nanna "aunt," Serbo-Croatian nena "mother," Italian nonna, Welsh nain "grandmother;" see nanny).']


In [2]: 

答案 1 :(得分:2)

xpath 工作的想法不是loop提取的列表,而是 xpath 中的父节点。

目前我的Mac上没有 scrapy ,但这里的技术应该同样适用,例如:

# I use lxml for loose html string parsing
from lxml import html

s = '''<dt><a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a> <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com"><img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com" /></a></dt>
<dd>1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).</dd>'''

sel = html.fromstring(s)

# rather than extracting the words straight away, you loop from the parent xpath
for nodes in sel.xpath('//dt'):
    # then access a node to get the text
    print nodes.xpath('a/text()')
    # and go back to parent and search the dd node
    print nodes.xpath('../dd/text()')

# sample results
['address (n.)']
['1530s, "dutiful or courteous approach," from ', ' (v.) and from French ', '. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']

希望这有帮助。

答案 2 :(得分:1)

使用以下兄弟的解决方案。

class SingleSpider(scrapy.Spider):
    name = "etym"
    allowed_domains = ["etymonline.com"]
    start_urls = [
        "http://www.etymonline.com/index.php?l=d&allowed_in_frame=0"]

    def parse(self, response):


        for nodes in response.xpath('//dl'):
            for i in nodes.xpath('dt'):
                print i.xpath('a/text()').extract()   
                print i.xpath('following-sibling::dd[1]/text()').extract()    

基本上:

  • 你逐一获得Dt元素
  • 打印链接中包含的文字
  • 移至下一个兄弟并打印包含的文字
  • 列表项

这是输出的摘录:

  

[u'daiquiri(n。)'] [u'type of alcoholic drink,1920;   F. Scott Fitzgerald),来自',u',一个地区或村庄的名字   古巴东部。']

     

[u'dairy(n。)'] [u'late 13c。,“用于制作黄油和奶酪的建筑物;   奶牛场,“由英法成立”,你加入中古英语   ',''(在',你',“乳制品”),来自古英语','''捏合者   面包,管家,女仆“(见',你'(n.1))。纯粹的   本地词是',你'。']

     

[u'dais(n。)'] [u'mid-13c。,来自盎格鲁 - 法语',你',古法语',你'   “桌子,平台”,来自拉丁语','''''的“盘形物体”,也是由   中世纪时期,“桌子”,来自希腊语','''quoit,disk,dish“(见',   你(n。))。死于英国的c.1600,保存在苏格兰,复活了   19C。通过古文物。']