Web-Harvest:从列表中抓取多个URL

时间:2013-07-31 23:29:48

标签: xml web-crawler webharvest

我正在尝试从预定义列表中获取多个网页。 这是代码:

<?xml version="1.0" encoding="UTF-8"?>
    <config>

      <script>
            <![CDATA[
                String[] codes = new String[] {"18","21","24","25","26"};
                SetContextVar("codes", codes);
            ]]>
      </script>
      <loop item="link">
            <list>
                <var name="codes" />
            </list>
            <body>

              <var-def name="webpage">
                  <html-to-xml>                                 
                    <http url="${sys.fullUrl('http://www.someurl.com/',link)}"/>            
                  </html-to-xml>
              </var-def> 
            </body>
        </loop>
    </config>

,错误是“变量赋值:代码:无法将org.webharvest.runtime.variables.ListVariable分配给java.lang.String”

我在这里缺少什么?

1 个答案:

答案 0 :(得分:1)

请尝试这个例子:

<config>

  <var-def name="Codes">
    <![CDATA[<Codes>]]>
    <![CDATA[<Code>]]>18<![CDATA[</Code>]]>
    <![CDATA[<Code>]]>21<![CDATA[</Code>]]>
    <![CDATA[<Code>]]>24<![CDATA[</Code>]]>
    <![CDATA[<Code>]]>25<![CDATA[</Code>]]>
    <![CDATA[</Codes>]]>
  </var-def>

  <loop item="CodesLoop" index="i">
    <list>
      <xpath expression="//Code/text()">
        <var name="Codes"/>
      </xpath>
    </list>
    <body>
      <file action="write" path="D:\ABC\${CodesLoop}.txt" charset="UTF-8">
        <template>${CodesLoop}</template>
      </file>
    </body>
  </loop>
</config>