如何使用Bash解析HTTP标头?

时间:2014-07-24 20:17:07

标签: linux bash curl

我需要从使用curl的网页标题中获取2个值。我已经能够使用以下方式单独获取值:

response1=$(curl -I -s http://www.example.com | grep HTTP/1.1 | awk {'print $2'})
response2=$(curl -I -s http://www.example.com | grep Server: | awk {'print $2'})

但我无法弄清楚如何使用单个curl请求单独grep值:

response=$(curl -I -s http://www.example.com)
http_status=$response | grep HTTP/1.1 | awk {'print $2'}
server=$response | grep Server: | awk {'print $2'}

每次尝试都会导致错误消息或空值。我确信这只是一个语法问题。

4 个答案:

答案 0 :(得分:13)

完整的bash解决方案。演示如何轻松解析其他标头而无需awk

shopt -s extglob # Required to trim whitespace; see below

while IFS=':' read key value; do
    # trim whitespace in "value"
    value=${value##+([[:space:]])}; value=${value%%+([[:space:]])}

    case "$key" in
        Server) SERVER="$value"
                ;;
        Content-Type) CT="$value"
                ;;
        HTTP*) read PROTO STATUS MSG <<< "$key{$value:+:$value}"
                ;;
     esac
done < <(curl -sI http://www.google.com)
echo $STATUS
echo $SERVER
echo $CT

产:

302
GFE/2.0
text/html; charset=UTF-8

根据RFC-2616,HTTP标头的建模方式如"Standard for the Format of ARPA Internet Text Messages" (RFC822)中所述,其中明确说明了第3.1.2节:

  

字段名称必须由可打印的ASCII字符组成           (即值在33.到126之间的字符。           十进制,冒号除外)。场体可以由任何物体组成           ASCII字符,CR或LF除外。 (虽然CR和/或LF可能是           在实际文本中,它们被动作删除           展开这个领域。)

所以上面的脚本应该捕获任何RFC- [2] 822兼容的头文件,但folded headers 除外。

答案 1 :(得分:2)

如果要提取多个标题,可以将所有标题填充到bash关联数组中。这是一个简单的函数,它假定任何给定的头只出现一次。 (不要将其用于Set-Cookie;请参阅下文。)

# Call this as: headers ARRAY URL
headers () {
  {
    # (Re)define the specified variable as an associative array.
    unset $1;
    declare -gA $1;
    local line rest

    # Get the first line, assuming HTTP/1.0 or above. Note that these fields
    # have Capitalized names.
    IFS=$' \t\n\r' read $1[Proto] $1[Status] rest
    # Drop the CR from the message, if there was one.
    declare -gA $1[Message]="${rest%$'\r'}"
    # Now read the rest of the headers. 
    while true; do
      # Get rid of the trailing CR if there is one.
      IFS=$'\r' read line rest;
      # Stop when we hit an empty line
      if [[ -z $line ]]; then break; fi
      # Make sure it looks like a header
      # This regex also strips leading and trailing spaces from the value
      if [[ $line =~ ^([[:alnum:]_-]+):\ *(( *[^ ]+)*)\ *$ ]]; then
        # Force the header to lower case, since headers are case-insensitive,
        # and store it into the array
        declare -gA $1[${BASH_REMATCH[1],,}]="${BASH_REMATCH[2]}"
      else
        printf "Ignoring non-header line: %q\n" "$line" >> /dev/stderr
      fi
    done
  } < <(curl -Is "$2")
}

示例:

$ headers so http://stackoverflow.com/
$ for h in ${!so[@]}; do printf "%s=%s\n" $h "${so[$h]}"; done | sort
Message=OK
Proto=HTTP/1.1
Status=200
cache-control=public, no-cache="Set-Cookie", max-age=43
content-length=224904
content-type=text/html; charset=utf-8
date=Fri, 25 Jul 2014 17:35:16 GMT
expires=Fri, 25 Jul 2014 17:36:00 GMT
last-modified=Fri, 25 Jul 2014 17:35:00 GMT
set-cookie=prov=205fd7f3-10d4-4197-b03a-252b60df7653; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
vary=*
x-frame-options=SAMEORIGIN

请注意,SO响应在Set-Cookie标头中包含一个或多个Cookie,但我们只能看到最后一个,因为天真脚本会覆盖具有相同标头名称的条目。 (碰巧,只有一个但我们无法知道。)虽然可以将脚本扩展到特殊情况Set-Cookie,但更好的方法可能是提供一个cookie-jar文件,并使用-b-c curl选项来维护它。

答案 2 :(得分:1)

使用进程替换(<( ... )),您可以读入shell变量:

sh$ read STATUS SERVER < <(
      curl -sI http://www.google.com | 
      awk '/^HTTP/ { STATUS = $2 } 
           /^Server:/ { SERVER = $2 } 
           END { printf("%s %s\n",STATUS, SERVER) }'
    )

sh$ echo $STATUS
302
sh$ $ echo $SERVER
GFE/2.0

答案 3 :(得分:0)

使用Bash> = 4.2功能改进和现代化的@rici's answer

  • 使用declare -n nameref变量来引用关联数组。
  • 使用declare -l自动小写变量值。
  • 使用${var@a}查询变量声明属性。
  • 更改为处理输入流,而不是调用curl命令。
  • 使其与RFC-2822's Folded Headers兼容
#!/usr/bin/env bash

shopt -s extglob # Requires extended globbing

# Process the input headers stream into an associative ARRAY
# @Arguments
# $1: The associative array receiving headers
# @Input
# &1: The headers stream
parse_headers() {
  if [ $# -ne 1 ]; then
    printf 'Need an associative array name argument\n' >&2
    return 1
  fi
  local -n header=$1 # Nameref argument
  # Check that argument is the name of an associative array
  case ${header@a} in
    A | At) ;;
    *)
      printf \
      'Variable %s with attributes %s is not a suitable associative array\n' \
      "${!header}" "${header@a}" >&2
      return 1
      ;;
  esac
  header=() # Clear the associative array
  local -- line rest v
  local -l k # Automatically lowercased

  # Get the first line, assuming HTTP/1.0 or above. Note that these fields
  # have Capitalized names.
  IFS=$' \t\n\r' read -r header['Proto'] header['Status'] rest
  # Drop the CR from the message, if there was one.
  header['Message']="${rest%%*([[:space:]])}"
  # Now read the rest of the headers.
  while IFS=: read -r line rest && [ -n "$line$rest" ]; do
    rest=${rest%%*([[:space:]])}
    rest=${rest##*([[:space:]])}
    line=${line%%*([[:space:]])}
    [ -z "$line" ] && break # Blank line is end of headers stream
    if [ -n "$rest" ]; then
      k=$line
      v=$rest
    else
      # Handle folded header
      # See: https://tools.ietf.org/html/rfc2822#section-2.2.3
      v+=" ${line##*([[:space:]])}"
    fi
    header["$k"]="$v"
  done
}

declare -A HTTP_HEADERS

parse_headers HTTP_HEADERS < <(
  curl \
    --silent \
    --head \
    --location \
    https://stackoverflow.com/q/24943170/7939871
)

for k in "${!HTTP_HEADERS[@]}"; do
  printf '[%q]=%q\n' "$k" "${HTTP_HEADERS[$k]}"
done

typeset -p HTTP_HEADERS