从css节点scrapy提取文本

时间:2018-08-12 04:20:52

标签: python css scrapy

我正在尝试从此页面抓取商品目录编号:

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='

response = HtmlResponse(url=url)

使用css选择器(在R中与rvest :: html_nodes一起使用)

".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"

我想检索目录ID,在这种情况下应该是:

6011038

如果可以通过xpath更轻松地完成操作

3 个答案:

答案 0 :(得分:1)

我在这里没什么问题,但是测试了这个xpath,它将为您带来href:

//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href

如果您在scrapy和CSS选择器语法方面遇到太多麻烦,我还建议您试用 BeautifulSoup python软件包。使用BeautifulSoup,您可以做类似的事情

link.get('href')

答案 1 :(得分:1)

如果您需要从id解析href

catalog_id = response.xpath("//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href").re_first( r'(\d+)$' )

答案 2 :(得分:0)

h5元素中似乎只有一个链接。简而言之:

Fatal error: Uncaught Symfony\Component\Debug\Exception\ClassNotFoundException: Attempted to load class "SensioFrameworkExtraBundle" from namespace "Sensio\Bundle\FrameworkExtraBundle".
Did you forget a "use" statement for another namespace? in /Users/dam/Development/Alara/rayflex/git/rayborn/src/Kernel.php:33
Stack trace:
#0 /Users/dam/Development/Alara/rayflex/git/rayborn/vendor/symfony/http-kernel/Kernel.php(492): App\Kernel->registerBundles()
#1 /Users/dam/Development/Alara/rayflex/git/rayborn/vendor/symfony/http-kernel/Kernel.php(132): Symfony\Component\HttpKernel\Kernel->initializeBundles()
#2 /Users/dam/Development/Alara/rayflex/git/rayborn/vendor/symfony/framework-bundle/Console/Application.php(64): Symfony\Component\HttpKernel\Kernel->boot()
#3 /Users/dam/Development/Alara/rayflex/git/rayborn/vendor/symfony/console/Application.php(148): Symfony\Bundle\FrameworkBundle\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#4 /Users/dam/ in /Users/dam/Development/Alara/rayflex/git/rayborn/src/Kernel.php on line 33