Question

我正在使用bash构建一个爬虫程序，它抓取网页并将html发送到api。 api期望有效json。

有很多网页都有这样的内容

<script type="text/javascript">
<!--//--><![CDATA[//><!--
window.jQuery || document.write("<script src='/sites/all/modules/contrib/jquery_update/replace/jquery/1.8/jquery.min.js'>\x3C/script>")
//--><!]]>
</script>

我使用curl下载页面

curl -s --cookie "$cookie_content" "$crawl_url" -o "$PWD/temp/curl.txt"

这是我对响应进行编码的原因：

body=`cat "$PWD/temp/curl.txt"`
body=`printf "%q" "$body"`
body=`echo ${body:2:${#body} - 3}` # this removes the $ and the ' around the string

转义"有效：

body=`echo "$body" | LC_ALL=C sed s/\"/'\\\"'/g`

但输出中仍有\'，即无效json。如果有更多网站我还没有测试，可能会有更多无效的json字符。

如何让每个网页转换为有效的json？

可能的结果可能是：

{
  "crawledAt": "2015-07-20...",
  "page": {
    "content": "<!DOCTYPE html>\n<html>\n..."
  }
}

在抓取网页时，bash中的curl无效JSON输出

0 个答案: