WebHarvest - 使用身份验证来废弃数据

时间:2014-04-14 07:47:14

标签: webharvest

我正在使用WebHarvest工具从少数网站中删除网页数据。我已经浏览了这些示例,但无法找到在网站中进行身份验证的方法,然后从中删除数据。任何人都可以引用一个示例配置来通过身份验证实现Web数据抓取吗?如何发送登录参数然后接收主页内容?感谢你的帮助。

1 个答案:

答案 0 :(得分:0)

我刚刚修改了Web Harvest的一个示例(http://web-harvest.sourceforge.net/samples.php?num=4),并且使用登录凭据运行正常。您可以获得更新的代码并尝试:

<?xml version="1.0" encoding="UTF-8"?>

<config charset="ISO-8859-1">

    <!-- sends post request with needed login information -->
    <http method="post" url="http://www.nytimes.com/auth/login">
        <http-param name="is_continue">true</http-param>
        <http-param name="URI">http://</http-param>
        <http-param name="OQ"></http-param>
        <http-param name="OP"></http-param>
        <http-param name="USERID">web-harvest</http-param>
        <http-param name="PASSWORD">web-harvest</http-param>
    </http>

    <var-def name="startUrl">http://www.nytimes.com/pages/todayspaper/index.html</var-def>

    <file action="write" path="D:/nytimes/nytimes${sys.date()}.xml" charset="UTF-8">
        <template>
            <![CDATA[ <newyourk_times date="${sys.datetime("dd.MM.yyyy")}"> ]]>
        </template>

        <loop item="articleUrl" index="i">
            <!-- collects URLs of all articles from the front page -->
            <list>
                <xpath expression="//div[@class='story']">
                    <html-to-xml>
                        <http url="${startUrl}"/>
                    </html-to-xml>
                </xpath>
            </list>

            <!-- downloads each article and extract data from it -->
            <body>
                <xquery>
                    <xq-param name="doc">
                        <var name="articleUrl"/>
                    </xq-param>
                    <xq-expression><![CDATA[
                        declare variable $doc as node() external;
                        $doc
                    ]]></xq-expression>
                </xquery>
            </body>
        </loop>

        <![CDATA[ </newyourk_times> ]]>
    </file>

</config>