浏览器和Web scraper使用相同的链接访问不同的源?

时间:2014-07-09 10:35:34

标签: scala xml-parsing xhtml web-scraping

我目前正在尝试使用scala和eclipse从某些网页中删除一些数据,我的问题是,当我在浏览器中查看页面的来源时,使用scala的xml读取内容似乎非常简单包:

<!doctype html>
  <html lang="de">
  <head>
  <meta charset="utf-8">
<title>some text</title>

<meta name="keywords" content="some text" />
<meta name="description" content="some text" />
<meta name="robots" content="noodp"/>
<meta name="page-topic" content="some text" />

<meta http-equiv="x-ua-compatible" content="ie=edge"/>
...

但是当我的小程序尝试使用相同的链接访问该页面以阅读内容时,它会读取以下内容:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de">
<head>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
  <title>some text</title>
  <link rel="shortcut icon" type="image/ico" href="/favicon.ico" />
  <link href="/res/im.min.css" media="all, handheld" rel="stylesheet" type="text/css" />
</head>
...

我收到以下错误:

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.wapforum.org/DTD/xhtml-mobile10.dtd
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1625)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:633)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1271)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1238)
    at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:260)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1153)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1049)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:962)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
    ...

那么为什么我只能访问我可以在浏览器中查看的其他(移动?)版本的页面?为什么我会收到这样的错误消息?

由于

0 个答案:

没有答案