Question

我有一个bash脚本，它从 robots.txt 文件中获取所有xml文件，并将HTTP Server响应输出到输出文件：

#!/usr/bin/env bash

#usage ./script.sh robots.txt

while read url
do
    urlstatus=$(curl -o /dev/null --silent --head --write-out '{http_code}' "$url" )
    echo "$url  $urlstatus" >> results.txt
done < $1

robots.txt 的示例可能如下所示：

http://www.youraddress.com/file1.xml
http://www.youraddress.com/file2.xml
http://www.youraddress.com/file3.xml

输出的示例：

http://www.youraddress.com/file1.xml 200
http://www.youraddress.com/file2.xml 200
http://www.youraddress.com/file3.xml 200

但每个XML文件都包含 loc 标记。

内部XML：

<url>
<loc>
    http://myother.address.com/
</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>

我需要做的就是从每个文件中获取：file1，file2 ......来自 loc 标记的每个URL，并打印HTTP Server的响应。

有人会让我得到一些提示，帮助处理基于此脚本的代码吗？

Answer 1

我考虑使用xmllint --noout＆＃34; $ url＆＃34;确保XML中没有语法错误。

Robots.text会在其中包含像Allow / 那样的行，或者应该这样，所以你需要只提取你想要的那些

sed -n -e 's/^ *sitemap: *//p'

要处理XML，您可以使用XSLT，例如

<xsl:stylesheet 
   mlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <!--* extract URLs one per line from loc elements;
      * match loc in any namespace (XSLT 1 method):
      *-->
  <xsl:template match="*[name() = 'loc']">
    <xsl:value-of select="."/>
    <xsl:text>&#xa;</xsl:text><!--* newline *-->
  </xsl:template>

  <xsl:template match="*"><xsl:apply-templates/></xsl:template>
  <xsl:template match="text()"></xsl:template>
</xsl:stylesheet>

（您可以使用xsltproc命令运行它）。或者使用其中一个XML模块在Perl或python中编写整个内容。

不要尝试使用shell解析XML。

Answer 2

你可以像下面这样：

$data['digit1']

Bash - Sitemap测试XML，LOC

2 个答案: