Python Goose无法提取mashable / usatoday / politicalwire文章

时间:2014-01-28 05:51:30

标签: python text-extraction goose

我正在使用python goose提取器,而且它在mashable.com和usatoday.com上的每篇文章都失败了。有人可以建议修复这个问题吗?

对于usatoday.com文章:

g = Goose()
article = g.extract(url='http://www.usatoday.com/story/tech/columnist/talkingtech/2014/01/25/namm-2014---ik-multimedias-rings-to-make-music/4863193/')
assert(article.cleaned_text=='')

对于可捣碎的文章:

g = Goose()
article = g.extract(url='http://mashable.com/2014/01/26/square-cofounder-jim-mckelvey/')
assert(article.cleaned_text=='')

对于politwire文章:

g = Goose()
article = g.extract(url='http://politicalwire.com/archives/2014/01/27/some_republicans_go_off_script_in_sotu_response.html')
assert(article.cleaned_text=='')

我认为这些是非常重要的文本提取网站。有人可以建议修复吗?感谢

1 个答案:

答案 0 :(得分:2)

来自here的Goose的最新版本可以从usatoday.com和mashable.com中提取