在shell脚本中解析URL

时间:2011-05-30 08:59:26

标签: parsing shell url

我有网址:

sftp://user@host.net/some/random/path

我想从此字符串中提取用户,主机和路径。任何部分都可以是随机长度。

16 个答案:

答案 0 :(得分:37)

假设您的URL作为第一个参数传递给脚本:

#!/bin/bash

# extract the protocol
proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${1/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host
host="$(echo ${url/$user@/} | cut -d/ -f1)"
# by request - try to extract the port
port="$(echo $host | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"

echo "url: $url"
echo "  proto: $proto"
echo "  user: $user"
echo "  host: $host"
echo "  port: $port"
echo "  path: $path"

我必须承认这不是最干净的解决方案,但它不依赖于其他脚本 像perl或python这样的语言。 (使用其中一个提供解决方案将产生更清晰的结果;))

使用您的示例,结果为:

url: user@host.net/some/random/path
  proto: sftp://
  user: user
  host: host.net
  port:
  path: some/random/path

这也适用于没有协议/用户名或路径的URL。 在这种情况下,相应的变量将包含一个空字符串。

<强> [编辑]
如果您的bash版本无法应对替换($ {1 / $ proto /}),请尝试以下操作:

#!/bin/bash

# extract the protocol
proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"

# remove the protocol -- updated
url=$(echo $1 | sed -e s,$proto,,g)

# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"

# extract the host -- updated
host=$(echo $url | sed -e s,$user@,,g | cut -d/ -f1)

# by request - try to extract the port
port="$(echo $host | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"

# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"

答案 1 :(得分:18)

以上,精炼(添加密码和端口解析),以及在/ bin / sh:

中工作
# extract the protocol
proto="`echo $DATABASE_URL | grep '://' | sed -e's,^\(.*://\).*,\1,g'`"
# remove the protocol
url=`echo $DATABASE_URL | sed -e s,$proto,,g`

# extract the user and password (if any)
userpass="`echo $url | grep @ | cut -d@ -f1`"
pass=`echo $userpass | grep : | cut -d: -f2`
if [ -n "$pass" ]; then
    user=`echo $userpass | grep : | cut -d: -f1`
else
    user=$userpass
fi

# extract the host -- updated
hostport=`echo $url | sed -e s,$userpass@,,g | cut -d/ -f1`
port=`echo $hostport | grep : | cut -d: -f2`
if [ -n "$port" ]; then
    host=`echo $hostport | grep : | cut -d: -f1`
else
    host=$hostport
fi

# extract the path (if any)
path="`echo $url | grep / | cut -d/ -f2-`"

发表b / c我需要它,所以我写了它(基于@ Shirkin的答案,显然),我想其他人可能会欣赏它。

答案 2 :(得分:9)

使用Python(这项工作的最佳工具,恕我直言):

#!/usr/bin/env python

import os
from urlparse import urlparse

uri = os.environ['NAUTILUS_SCRIPT_CURRENT_URI']
result = urlparse(uri)
user, host = result.netloc.split('@')
path = result.path
print('user=', user)
print('host=', host)
print('path=', path)

进一步阅读:

答案 3 :(得分:3)

这是我的看法,基于一些现有的答案,但它也可以处理GitHub SSH克隆URL:

#!/bin/bash

PROJECT_URL="git@github.com:heremaps/here-aaa-java-sdk.git"

# Extract the protocol (includes trailing "://").
PARSED_PROTO="$(echo $PROJECT_URL | sed -nr 's,^(.*://).*,\1,p')"

# Remove the protocol from the URL.
PARSED_URL="$(echo ${PROJECT_URL/$PARSED_PROTO/})"

# Extract the user (includes trailing "@").
PARSED_USER="$(echo $PARSED_URL | sed -nr 's,^(.*@).*,\1,p')"

# Remove the user from the URL.
PARSED_URL="$(echo ${PARSED_URL/$PARSED_USER/})"

# Extract the port (includes leading ":").
PARSED_PORT="$(echo $PARSED_URL | sed -nr 's,.*(:[0-9]+).*,\1,p')"

# Remove the port from the URL.
PARSED_URL="$(echo ${PARSED_URL/$PARSED_PORT/})"

# Extract the path (includes leading "/" or ":").
PARSED_PATH="$(echo $PARSED_URL | sed -nr 's,[^/:]*([/:].*),\1,p')"

# Remove the path from the URL.
PARSED_HOST="$(echo ${PARSED_URL/$PARSED_PATH/})"

echo "proto: $PARSED_PROTO"
echo "user: $PARSED_USER"
echo "host: $PARSED_HOST"
echo "port: $PARSED_PORT"
echo "path: $PARSED_PATH"

给出了

proto:
user: git@
host: github.com
port:
path: :heremaps/here-aaa-java-sdk.git

对于PROJECT_URL="ssh://sschuberth@git.eclipse.org:29418/jgit/jgit",你得到了

proto: ssh://
user: sschuberth@
host: git.eclipse.org
port: :29418
path: /jgit/jgit

答案 4 :(得分:3)

此解决方案原则上与Adam Ryczkowski's在此线程中的作用相同 - 但改进了基于RFC3986的正则表达式(带有一些更改)并修复了一些错误(例如userinfo可以包含'_ '性格)。这也可以理解相对URI(例如,提取查询或片段)。

# !/bin/bash

# Following regex is based on https://tools.ietf.org/html/rfc3986#appendix-B with
# additional sub-expressions to split authority into userinfo, host and port
#
readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?(/([^?#]*))(\?([^#]*))?(#(.*))?'
#                    ↑↑            ↑  ↑↑↑            ↑         ↑ ↑            ↑ ↑        ↑  ↑        ↑ ↑
#                    |2 scheme     |  ||6 userinfo   7 host    | 9 port       | 11 rpath |  13 query | 15 fragment
#                    1 scheme:     |  |5 userinfo@             8 :…           10 path    12 ?…       14 #…
#                                  |  4 authority
#                                  3 //…

parse_scheme () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[2]}"
}

parse_authority () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[4]}"
}

parse_user () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[6]}"
}

parse_host () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[7]}"
}

parse_port () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[9]}"
}

parse_path () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[10]}"
}

parse_rpath () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[11]}"
}

parse_query () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[13]}"
}

parse_fragment () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[15]}"
}

答案 5 :(得分:2)

如果你真的想在shell中做,你可以使用awk做一些简单的事情。这需要知道您将实际传递多少个字段(例如,有时没有密码,有时没有密码)。

#!/bin/bash

FIELDS=($(echo "sftp://user@host.net/some/random/path" \
  | awk '{split($0, arr, /[\/\@:]*/); for (x in arr) { print arr[x] }}'))
proto=${FIELDS[1]}
user=${FIELDS[2]}
host=${FIELDS[3]}
path=$(echo ${FIELDS[@]:3} | sed 's/ /\//g')

如果你没有awk并且你确实有grep,并且你可以要求每个字段至少有两个字符并且格式可以合理预测,那么你可以这样做:

#!/bin/bash

FIELDS=($(echo "sftp://user@host.net/some/random/path" \
   | grep -o "[a-z0-9.-][a-z0-9.-]*" | tr '\n' ' '))
proto=${FIELDS[1]}
user=${FIELDS[2]}
host=${FIELDS[3]}
path=$(echo ${FIELDS[@]:3} | sed 's/ /\//g')

答案 6 :(得分:2)

只需要这样做,所以很奇怪是否可以单行完成,这就是我所拥有的:

#!/bin/bash

parse_url() {
  eval $(echo "$1" | sed -e "s#^\(\(.*\)://\)\?\(\([^:@]*\)\(:\(.*\)\)\?@\)\?\([^/?]*\)\(/\(.*\)\)\?#${PREFIX:-URL_}SCHEME='\2' ${PREFIX:-URL_}USER='\4' ${PREFIX:-URL_}PASSWORD='\6' ${PREFIX:-URL_}HOST='\7' ${PREFIX:-URL_}PATH='\9'#")
}

URL=${1:-"http://user:pass@example.com/path/somewhere"}
PREFIX="URL_" parse_url "$URL"
echo "$URL_SCHEME://$URL_USER:$URL_PASSWORD@$URL_HOST/$URL_PATH"

工作原理:

  1. 有一个疯狂的sed正则表达式捕获url的所有部分,当所有部分都是可选的(主机名除外)
  2. 使用这些捕获组sed输出env变量名称及其相关部分的值(如URL_SCHEME或URL_USER)
  3. eval执行该输出,导致这些变量在脚本中导出并可用
  4. 可以选择PREFIX来控制输出env变量名称
  5. PS:小心使用此代码进行任意输入,因为此代码易受脚本注入攻击。

答案 7 :(得分:2)

您可以使用bash 字符串操作。很容易学习。如果您在使用正则表达式时遇到困难,请尝试一下。由于它来自NAUTILUS_SCRIPT_CURRENT_URI,因此我猜该URI中可能有端口。因此,我还保留了该可选内容。

#!/bin/bash

#You can also use environment variable $NAUTILUS_SCRIPT_CURRENT_URI
X="sftp://user@host.net/some/random/path"

tmp=${X#*//};usr=${tmp%@*}
tmp=${X#*@};host=${tmp%%/*};[[ ${X#*://} == *":"* ]] && host=${host%:*}
tmp=${X#*//};path=${tmp#*/}
proto=${X%:*}
[[ ${X#*://} == *":"* ]] && tmp=${X##*:} && port=${tmp%%/*}

echo "Potocol:"$proto" User:"$usr" Host:"$host" Port:"$port" Path:"$path

答案 8 :(得分:1)

我不喜欢上面的方法而且写了我自己的方法。它适用于ftp链接,如果需要,只需将ftp替换为http即可。 第一行是链接的小验证,链接应该看起来像ftp://user:pass@host.com/path/to/something

if ! echo "$url" | grep -q '^[[:blank:]]*ftp://[[:alnum:]]\+:[[:alnum:]]\+@[[:alnum:]\.]\+/.*[[:blank:]]*$'; then return 1; fi

login=$(  echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\1|' )
pass=$(   echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\2|' )
host=$(   echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\3|' )
dir=$(    echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\4|' )

我的实际目标是通过网址检查ftp访问权限。以下是完整的结果:

#!/bin/bash

test_ftp_url()  # lftp may hang on some ftp problems, like no connection
    {
    local url="$1"

    if ! echo "$url" | grep -q '^[[:blank:]]*ftp://[[:alnum:]]\+:[[:alnum:]]\+@[[:alnum:]\.]\+/.*[[:blank:]]*$'; then return 1; fi

    local login=$(  echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\1|' )
    local pass=$(   echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\2|' )
    local host=$(   echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\3|' )
    local dir=$(    echo "$url" | sed 's|[[:blank:]]*ftp://\([^:]\+\):\([^@]\+\)@\([^/]\+\)\(/.*\)[[:blank:]]*|\4|' )

    exec 3>&2 2>/dev/null
    exec 6<>"/dev/tcp/$host/21" || { exec 2>&3 3>&-; echo 'Bash network support is disabled. Skipping ftp check.'; return 0; }

    read <&6
    if ! echo "${REPLY//$'\r'}" | grep -q '^220'; then exec 2>&3  3>&- 6>&-; return 3; fi   # 220 vsFTPd 3.0.2+ (ext.1) ready...

    echo -e "USER $login\r" >&6; read <&6
    if ! echo "${REPLY//$'\r'}" | grep -q '^331'; then exec 2>&3  3>&- 6>&-; return 4; fi   # 331 Please specify the password.

    echo -e "PASS $pass\r" >&6; read <&6
    if ! echo "${REPLY//$'\r'}" | grep -q '^230'; then exec 2>&3  3>&- 6>&-; return 5; fi   # 230 Login successful.

    echo -e "CWD $dir\r" >&6; read <&6
    if ! echo "${REPLY//$'\r'}" | grep -q '^250'; then exec 2>&3  3>&- 6>&-; return 6; fi   # 250 Directory successfully changed.

    echo -e "QUIT\r" >&6

    exec 2>&3  3>&- 6>&-
    return 0
    }

test_ftp_url 'ftp://fz223free:fz223free@ftp.zakupki.gov.ru/out/nsi/nsiProtocol/daily'
echo "$?"

答案 9 :(得分:1)

If you have access to Bash >= 3.0 you can do this in pure bash as well, thanks to the re-match operator =~:

pattern='^(([[:alnum:]]+)://)?(([[:alnum:]]+)@)?([^:^@]+)(:([[:digit:]]+))?$' if [[ "http://us@cos.com:3142" =~ $pattern ]]; then proto=${BASH_REMATCH[2]} user=${BASH_REMATCH[4]} host=${BASH_REMATCH[5]} port=${BASH_REMATCH[7]} fi

It should be faster and less resource-hungry then all the previous examples, because no external process is be spawned.

答案 10 :(得分:1)

我没有足够的声誉来发表评论,但是我对@ patryk-obara的answer做了一些小的修改。

RFC3986§6.2.3。 基于方案的规范化 对待

http://example.com
http://example.com/

等效。但是我发现他的正则表达式与http://example.com之类的URL不匹配。 http://example.com/(带有斜杠)确实匹配。

我插入了11,将/更改为(/|$)。这匹配/或字符串的结尾。现在http://example.com匹配了。

readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?((/|$)([^?#]*))(\?([^#]*))?(#(.*))?$'
#                    ↑↑            ↑  ↑↑↑            ↑         ↑ ↑            ↑↑    ↑        ↑  ↑        ↑ ↑
#                    ||            |  |||            |         | |            ||    |        |  |        | |
#                    |2 scheme     |  ||6 userinfo   7 host    | 9 port       ||    12 rpath |  14 query | 16 fragment
#                    1 scheme:     |  |5 userinfo@             8 :...         ||             13 ?...     15 #...
#                                  |  4 authority                             |11 / or end-of-string
#                                  3  //...                                   10 path

答案 11 :(得分:0)

我做了进一步的解析,扩展了@Shirkrin给出的解决方案:

#!/bin/bash

parse_url() {
    local query1 query2 path1 path2

    # extract the protocol
    proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"

    if [[ ! -z $proto ]] ; then
            # remove the protocol
            url="$(echo ${1/$proto/})"

            # extract the user (if any)
            login="$(echo $url | grep @ | cut -d@ -f1)"

            # extract the host
            host="$(echo ${url/$login@/} | cut -d/ -f1)"

            # by request - try to extract the port
            port="$(echo $host | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"

            # extract the uri (if any)
            resource="/$(echo $url | grep / | cut -d/ -f2-)"
    else
            url=""
            login=""
            host=""
            port=""
            resource=$1
    fi

    # extract the path (if any)
    path1="$(echo $resource | grep ? | cut -d? -f1 )"
    path2="$(echo $resource | grep \# | cut -d# -f1 )"
    path=$path1
    if [[ -z $path ]] ; then path=$path2 ; fi
    if [[ -z $path ]] ; then path=$resource ; fi

    # extract the query (if any)
    query1="$(echo $resource | grep ? | cut -d? -f2-)"
    query2="$(echo $query1 | grep \# | cut -d\# -f1 )"
    query=$query2
    if [[ -z $query ]] ; then query=$query1 ; fi

    # extract the fragment (if any)
    fragment="$(echo $resource | grep \# | cut -d\# -f2 )"

    echo "url: $url"
    echo "   proto: $proto"
    echo "   login: $login"
    echo "    host: $host"
    echo "    port: $port"
    echo "resource: $resource"
    echo "    path: $path"
    echo "   query: $query"
    echo "fragment: $fragment"
    echo ""
}

parse_url "http://login:password@example.com:8080/one/more/dir/file.exe?a=sth&b=sth#anchor_fragment"
parse_url "https://example.com/one/more/dir/file.exe#anchor_fragment"
parse_url "http://login:password@example.com:8080/one/more/dir/file.exe#anchor_fragment"
parse_url "ftp://user@example.com:8080/one/more/dir/file.exe?a=sth&b=sth"
parse_url "/one/more/dir/file.exe"
parse_url "file.exe"
parse_url "file.exe#anchor"

答案 12 :(得分:0)

如果您有权访问Node.js:

name    |  age  |  fav_color    |  fav_animal
----------------------------------------------
Bob     |  39   |  green        |  NULL
Alice   |  NULL |  blue         |  dog     

这将输出:

export MY_URI=sftp://user@host.net/some/random/path
node -e "console.log(url.parse(process.env.MY_URI).user)"
node -e "console.log(url.parse(process.env.MY_URI).host)"
node -e "console.log(url.parse(process.env.MY_URI).path)"

答案 13 :(得分:0)

一种从完整URL中仅获取域的简单方法:

echo https://stackoverflow.com/questions/6174220/parse-url-in-shell-script | cut -d/ -f1-3

# OUTPUT>>> https://stackoverflow.com

仅获取路径:

echo https://stackoverflow.com/questions/6174220/parse-url-in-shell-script | cut -d/ -f4-

# OUTPUT>>> questions/6174220/parse-url-in-shell-script

不太完美,因为第二个命令会去除前面的斜杠,所以您需要手工将其添加在前面。

仅在此处获取基于awk的版本才能获取路径:

echo https://stackoverflow.com/questions/6174220/parse-url-in-shell-script/59971653 | awk -F"/" '{ for (i=4; i<=NF; i++) printf"/%s", $i }'

# OUTPUT>>> /questions/6174220/parse-url-in-shell-script/59971653

答案 14 :(得分:0)

这是一个纯 bash url 解析器。它支持 git ssh 克隆风格的 URL 以及标准的 proto:// 。该示例忽略了协议、身份验证和端口,但您可以根据需要进行修改以收集...我使用 regex101 进行方便的测试:https://regex101.com/r/5QyNI5/1

TEST_URLS=(
  https://github.com/briceburg/tools.git
  https://foo:12333@github.com:8080/briceburg/tools.git
  git@github.com:briceburg/tools.git
  https://me@gmail.com:12345@my.site.com:443/p/a/t/h
)

for url in "${TEST_URLS[@]}"; do
  without_proto="${url#*:\/\/}"
  without_auth="${without_proto##*@}"
  [[ $without_auth =~ ^([^:\/]+)(:[[:digit:]]+\/|:|\/)?(.*) ]]
  PROJECT_HOST="${BASH_REMATCH[1]}"
  PROJECT_PATH="${BASH_REMATCH[3]}"

  echo "given: $url"
  echo "  -> host: $PROJECT_HOST path: $PROJECT_PATH"
done

结果:

given: https://github.com/briceburg/tools.git
  -> host: github.com path: briceburg/tools.git
given: https://foo:12333@github.com:8080/briceburg/tools.git
  -> host: github.com path: briceburg/tools.git
given: git@github.com:briceburg/tools.git
  -> host: github.com path: briceburg/tools.git
given: https://me@gmail.com:12345@my.site.com:443/p/a/t/h
  -> host: my.site.com path: p/a/t/h

答案 15 :(得分:0)

我发现 Adam Ryczkowski's 个回答很有帮助。原来的解决方案没有处理URL中的/path,所以我稍微加强了一下。

pattern='^(([[:alnum:]]+):\/\/)?(([[:alnum:]]+)@)?([^:^@\/]+)(:([[:digit:]]+))?(\/?[^:^@]?)$'
url="http://us@cos.com:3142/path"
if [[ "$url" =~ $pattern ]]; then
    proto=${BASH_REMATCH[2]}
    user=${BASH_REMATCH[4]}
    host=${BASH_REMATCH[5]}
    port=${BASH_REMATCH[7]}
    path=${BASH_REMATCH[8]}
    echo "proto: $proto"
    echo "user: $user"
    echo "host: $host"
    echo "port: $port"
    echo "path= $path"
else
    echo "URL did not match pattern: $url"
fi

模式很复杂,所以请使用这个网站来更好地理解它:https://regex101.com/

我用一堆 URL 对其进行了测试。但是,如果有任何问题,请告诉我。