我试图仅使用python Scrapy从身体上刮掉文本,但还没有运气。
希望有些学者可以帮助我在这里抓取<body>
标签中的所有文字。
答案 0 :(得分:4)
Scrapy使用XPath表示法来提取HTML文档的一部分。那么,您是否尝试过使用/html/body
路径来提取<body>
? (假设它嵌套在<html>
)。使用//body
选择器可能更简单:
x.select("//body").extract() # extract body
您可以找到有关Scrapy提供的选择器的更多信息here。
答案 1 :(得分:2)
获取由lynx -nolist -dump
生成的输出会很好,它会呈现页面然后转储可见文本。通过提取段落元素的所有子项的文本,我已经接近了。
我从//body//text()
开始,它将所有文本元素拉到了正文中,但这包含了脚本元素。 //body//p
获取正文中的所有段落元素,包括未标记文本周围的隐含段落标记。使用//body//p/text()
提取文本时会遗漏子标签中的元素(例如粗体,斜体,span,div)。 //body//p//text()
似乎获得了大部分所需内容,只要该页面没有嵌入段落中的脚本标记。
/
表示直接子项,而//
包含所有后代。
% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()
Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'<body>',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'<body>',
u"? (assuming it's nested in ",
u'<html>',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',
将字符串与空格连接在一起,您的输出效果非常好:
In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the <body> tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body> ? (assuming it's nested in <html> ). It might be even simpler to use the //body selector: You can find more information about the selectors Scrapy provides here . This is a collaboratively edited question and answer site for professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n tagged asked 1 year ago viewed 280 times active 1 year ago"