是否有任何网络抓取工具适用于解析许多非结构化网站(新闻,文章)并从中提取主要内容块而没有先前定义的规则?
我的意思是,当我解析新闻Feed时,我想从每篇文章中提取主要内容块来做一些NLP的事情。我有很多网站,需要永远查看他们的 DOM 模型,并为每个网站编写规则。
我尝试使用 Scrapy 并获取所有没有标签和脚本的文本,放在一个正文中,但它包含许多不相关的内容,如菜单项,广告块等。
site_body = selector.xpath('//body').extract_first()
但是对这类内容进行NLP并不是很精确。
那么还有其他工具或方法来完成这些任务吗?
答案 0 :(得分:0)
我试图用pattern matching来解决这个问题。因此,您可以注释网页本身的来源并将其用作匹配的样本,并且您不需要编写特殊规则。
例如,如果查看此页面的来源,您会看到:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?</p>
然后,您删除文字并添加{.}
以将地点标记为相关,并获取:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
{.}
(通常你也需要关闭标签,但对于单个元素则没有必要)
然后你将它作为模式传递给Xidel(SO似乎阻止了默认的用户代理,因此需要更改),
xidel 'http://stackoverflow.com/questions/36066030/web-crawler-for-unstructured-data' --user-agent "Mozilla/5.0 (compatible; Xidel)" -e '<td class="postcell"><div><div class="post-text" itemprop="text">{.}'
并输出您的文字
Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?
I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them.
I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc.
site_body = selector.xpath('//body').extract_first()
But doing NLP over such kind of content will not be very precise.
So is there any other tools or approaches for doing such tasks?
答案 1 :(得分:0)
您可以在parse()
和get_text()
:
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(response.body, 'html.parser')
yield {'body': soup.get_text() }
你也可以手动删除你不想要的东西(如果你发现你喜欢某些标记,例如<H1>
或<b>
可能有用信号)
# Remove invisible tags
#for i in soup.findAll(lambda tag: tag.name in ['script', 'link', 'meta']):
# i.extract()
你可以做类似的事情,将一些标签列入白名单。