使用webharvest从网站上抓取数据

时间:2014-06-07 08:00:46

标签: html webharvest

我正在尝试废弃网站上的所有html网页" http://www.tecomdirectory.com/"使用webharvest。但是脚本无法抓取所有的html页面,只抓住了几个html页面。我使用以下脚本:

<!-- set initial page -->
<var-def name="home">http://www.tecomdirectory.com</var-def>

<!-- define script functions and variables -->
<script><![CDATA[
    /* checks if specified URL is valid for download */
    boolean isValidUrl(String url) {
        String urlSmall = url.toLowerCase();
        return urlSmall.startsWith("http://www.tecomdirectory.com/") && urlSmall.endsWith(".html");
    }

    /* create filename based on specified URL */
    String makeFilename(String url) {
        return url.replaceAll("http://|https://|file://", "");
    }

    /* set of unvisited URLs */
    Set unvisited = new HashSet();
    unvisited.add(home);

    /* pushes to web-harvest context initial set of unvisited pages */
    SetContextVar("unvisitedVar", unvisited);

    /* set of visited URLs */
    Set visited = new HashSet();
]]></script>

<!-- loop while there are any unvisited links -->
<while condition="${unvisitedVar.toList().size() != 0}">
    <loop item="currUrl">
        <list><var name="unvisitedVar"/></list>
        <body>
            <empty>
                <var-def name="content">
                    <html-to-xml>
                        <http url="${currUrl}"/>
                    </html-to-xml>
                </var-def>

                <script><![CDATA[
                    currentFullUrl = sys.fullUrl(home, currUrl);
                ]]></script>

                <!--  saves downloaded page -->
                <file action="write" path="spider/${makeFilename(currentFullUrl)}.html">
                    <var name="content"/>
                </file>

                <!-- adds current URL to the list of visited -->
                <script><![CDATA[
                    visited.add(sys.fullUrl(home, currUrl));
                    Set newLinks = new HashSet();
                    print(currUrl);
                ]]></script>

                <!-- loop through all collected links on the downloaded page -->
                <loop item="currLink">
                    <list>
                        <xpath expression="//a/@href">
                            <var name="content"/>
                        </xpath>
                    </list>
                    <body>
                        <script><![CDATA[
                            String fullLink = sys.fullUrl(home, currLink);
                            if ( isValidUrl(fullLink.toString()) && !visited.contains(fullLink) && !unvisitedVar.toList().contains(fullLink) ) {
                                newLinks.add(fullLink);
                            }
                        ]]></script>
                    </body>
                </loop>
            </empty>
        </body>
    </loop>

    <!-- unvisited link are now all the collected new links from downloaded pages  -->
    <script><![CDATA[
         SetContextVar("unvisitedVar", newLinks);
    ]]></script>
</while>

请帮忙。提前致谢

1 个答案:

答案 0 :(得分:0)

尝试使用 visual web ripper 进行网络收集。使用webharvest你将面临很多问题。