Question

我使用YQL获取一些html页面来读取信息。从今天起，我收到了返回消息“不再支持html表。有关YQL使用条款，请参阅https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm”

控制台中的示例：https://developer.yahoo.com/yql/console/#h=select+ * +来自+ html +其中+ url％3D％22http％3A％2F％2Fwww.google.de％22

雅虎是否停止了这项服务？有人知道雅虎的某种声明吗？我想知道这只是一个错误，还是他们真的停止了这项服务......

所有文档仍然存在（html抓取）： https://developer.yahoo.com/yql/guide/yql-select-xpath.html， https://developer.yahoo.com/yql/

前一段时间我在雅虎的YQL论坛上发帖，现在这个已经不存在了（或者至少我找不到它）。您如何联系雅虎以了解此服务是否真的停止了？

祝你好运， hebr3

Answer 1

截至2017年6月8日，雅虎确实已经结束了对html库的支持（根据我的错误日志）。似乎没有任何官方公告。

幸运的是，有一个YQL社区库可以代替官方的html库，只需对代码库进行一些更改。请参阅htmlstring table in the YQL Console。

将您的YQL查询更改为引用htmltable而不是html，并在REST查询中包含社区环境。例如：

/*/ Old code /*/

var site = "http://www.test.com/foo.html";

var yql = "select * from html where url='" + site + "' AND xpath='//div'";

var resturl = "https://query.yahooapis.com/v1/public/yql?q="
    + encodeURIComponent(yql) + "&format=json";

/*/ New code /*/

var site = "http://www.test.com/foo.html";

var yql = "select * from htmlstring where url='" + site + "' AND xpath='//div'";

var resturl = "https://query.yahooapis.com/v1/public/yql?q="
    + encodeURIComponent(yql) + "&format=json"
    + "&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys";

Answer 2

非常感谢你的代码。

它帮助我创建了自己的脚本来读取我需要的那些页面。我之前从未编写过PHP，但凭借您的代码和互联网的智慧，我可以根据自己的需要更改脚本。

PHP

<?
    header('Access-Control-Allow-Origin: *'); //all
    $url = $_GET['url'];
    if (substr($url,0,25) != "https://www.xxxx.yy") {
       echo "Only https://www.xxxx.yy allowed!";
       return;
    }
    $xpathQuery = $_GET['xpath'];

    //need more hard check for security, I made only basic
   function check($target_url){
       $check = curl_init();
       //curl_setopt( $check, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
        //curl_setopt($check, CURLOPT_INTERFACE, "xxx.xxx.xxx.xxx");
        curl_setopt($check, CURLOPT_COOKIEJAR, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_COOKIEFILE, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_TIMEOUT, 40000);
        curl_setopt($check, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($check, CURLOPT_URL, $target_url);
        curl_setopt($check, CURLOPT_USERAGENT,   $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($check, CURLOPT_FOLLOWLOCATION, false);
        $tmp = curl_exec ($check);
        curl_close ($check);
        return $tmp;
    } 

    // get html
    $html = check($url);
    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    // apply xpath filter
    $xpath = new DOMXPath($dom);
    $elements = $xpath->query($xpathQuery);
    $temp_dom = new DOMDocument();
    foreach($elements as $n)   $temp_dom->appendChild($temp_dom->importNode($n,true));
    $renderedHtml = $temp_dom->saveHTML();

    // return html in json response
    // json structure: 
    // {html: "xxxx"}
    $post_data = array(
      'html' => $renderedHtml
    );  
    echo json_encode($post_data); 

?>

的Javascript

$.ajax({
    url: "url of service",
    dataType: "json", 
    data: { url: url,
            xpath: "//*"
          },
    type: 'GET',
    success: function() {
             },
    error: function(data) {
           }
});

Answer 3

即使YQL不再支持html表，我已经意识到，不是进行一次网络调用并解析结果，而是可以进行多次调用。例如，我以前的电话会是这样的：

select html from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

哪个应该给我以下信息

现在我必须使用这两个：

select title from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

select description from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

..得到我想要的东西。我不知道为什么他们会弃用这样的东西而没有明确列出的后备，但你应该能够以这种方式得到你的数据。

Answer 4

我构建了一个名为CloudQuery（source code）的开源工具，最近提供了与yql类似的功能。只需单击几下，便可以将大多数网站转换为API。

YQL：不再支持html表格

4 个答案: