无法使用samppipe解析纽约时报文章

时间:2015-02-19 12:29:41

标签: java rss boilerpipe

我正试图从纽约时报获取新闻文章'网址,但它没有提供任何输出,但如果我尝试任何其他报纸它给出输出。我想知道我的代码是否有问题或者samppipe无法获取它。另外,有时输出不是英语,意味着它以unicode显示主要是针对“每日新闻”,我也想知道原因。     import java.io.InputStream;     import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.DefaultExtractor;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;

class ExtractData
{
    public static void main(final String[] args) throws Exception 
    {
        URL url;
        url = new URL(
                "http://www.nytimes.com/2013/03/02/nyregion/us-judges-offer-addicts-a-way-to-avoid-prison.html?hp&_r=0");

        // NOTE We ignore HTTP-based character encoding in this demo...
        final InputStream urlStream = url.openStream();
        final InputSource is = new InputSource(urlStream);
        final BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
        final TextDocument doc = in.getTextDocument();
        urlStream.close();

        // You have the choice between different Extractors

        //System.out.println(DefaultExtractor.INSTANCE.getText(doc));
        System.out.println(ArticleExtractor.INSTANCE.getText(doc));
    }
}

1 个答案:

答案 0 :(得分:1)

Nytimes.com有一个付费专区,它会根据您的请求返回HTTP 303,您可以尝试handle the redirect and cookies。尝试其他用户代理字符串也可能有效。