Question

在我的bash脚本中，我只需要从给定的URL中提取路径。例如，来自包含字符串的变量：

http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth

我想只提取一些其他变量：

/one/more/dir/file.exe

一部分。当然登录，密码，文件名和参数是可选的。

因为我是sed和awk的新手，我请求你帮忙。拜托，建议我怎么做。谢谢！

Answer 1

bash中有内置函数来处理这个问题，例如字符串模式匹配运算符：

'＃'删除最小匹配前缀
'##'删除最大匹配前缀
'％'删除最小匹配后缀
'%%'删除最大匹配后缀

例如：

FILE=/home/user/src/prog.c
echo ${FILE#/*/}  # ==> user/src/prog.c
echo ${FILE##/*/} # ==> prog.c
echo ${FILE%/*}   # ==> /home/user/src
echo ${FILE%%/*}  # ==> nil
echo ${FILE%.c}   # ==> /home/user/src/prog

所有这些都来自优秀的书：“Mark G. Sobell的Linux命令，编辑和Shell编程实用指南”（http://www.sobell.com/）

Answer 2

在bash中：

URL='http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth'
URL_NOPRO=${URL:7}
URL_REL=${URL_NOPRO#*/}
echo "/${URL_REL%%\?*}"

仅当URL以http://或具有相同长度的协议开头时才有效否则，使用sed，grep或cut ...

可能更容易使用正则表达式

Answer 3

这使用 bash 和剪切作为另一种方式。这很难看，但它有效（至少在这个例子中）。有时我喜欢使用我称之为 cut 的筛子来减少我实际需要的信息。

注意：性能方面，这可能是一个问题。

考虑到这些警告：

首先让我们回应一下：

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth'

这给了我们：

http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth

然后让我们剪切 @ 上的一行，作为去除 http://login:password 的便捷方式：

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2

给我们这个：

example.com/one/more/dir/file.exe?a=sth&b=sth

为了摆脱主机名，让我们做另一个 cut 并使用 / 作为分隔符，同时要求cut给我们第二个字段和之后的所有内容（基本上，到最后一行）。它看起来像这样：

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2-

反过来导致：

一个/多个/目录/的file.exe一个=某物＆安培; B =某物

最后，我们想从最后剥离所有参数。同样，我们将使用 cut ，这次使用？作为分隔符并告诉它只给我们第一个字段。这让我们走到了尽头，看起来像这样：

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2- | \
cut -d? -f1

输出是：

一个/多个/目录/的file.exe

另一种方法是这样做，这种方法可以通过交互方式减少您不需要的数据，从而得到您需要的东西。

如果我想将其填充到脚本中的变量中，我会做这样的事情：

#!/bin/bash

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
file_path=$(echo ${url} | cut -d@ -f2 | cut -d/ -f2- | cut -d? -f1)
echo ${file_path}

希望它有所帮助。

Answer 4

gawk的

echo "http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth" | awk -F"/" '
{
 $1=$2=$3=""
 gsub(/\?.*/,"",$NF)
 print substr($0,3)
}' OFS="/"

输出

# ./test.sh
/one/more/dir/file.exe

Answer 5

如果你有一个傻瓜：

$ echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
  gawk '$0=gensub(/http:\/\/[^/]+(\/[^?]+)\?.*/,"\\1",1)'

或

$ echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
  gawk -F'(http://[^/]+|?)' '$0=$2'

Gnu awk可以使用正则表达式作为字段分隔符（FS）。

Answer 6

Perl片段很吸引人，因为Perl存在于大多数Linux发行版中，非常有用，但是......它不能完全完成这项工作。具体而言，将URL / URI格式从UTF-8转换为路径Unicode存在问题。让我举一个问题的例子。原始URI可能是：

file:///home/username/Music/Jean-Michel%20Jarre/M%C3%A9tamorphoses/01%20-%20Je%20me%20souviens.mp3

相应的路径是：

/home/username/Music/Jean-Michel Jarre/Métamorphoses/01 - Je me souviens.mp3

%20成为空间，%C3%A9成为'é'。是否有可以处理此转换的Linux命令，bash功能或Perl脚本，或者我是否必须编写大量的sed子串替换？从路径到URL / URI的反向转换怎么样？

（后续）

看http://search.cpan.org/~gaas/URI-1.54/URI.pm，我第一次看到了as_iri方法，但是我的Linux显然没有这种方法（或者某种方式不适用）。原来解决方案是用“ - ＆gt;文件”替换“ - ＆gt; path”部分。然后，您可以使用basename和dirname等将其进一步分解。解决方案是：

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->file' )

奇怪的是，使用“ - ＆gt; dir”而不是“ - ＆gt; file”不会提取目录部分：相反，它会格式化URI，因此它可以用作mkdir之类的参数。

（进一步跟进）

为什么这条线不能缩短到这个原因？

path=$( echo "$url" | perl -MURI -le 'print URI->new(<>)->file' )

Answer 7

最好的办法是找到一个具有URL解析库的语言：

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
path=$( echo "$url" | ruby -ruri -e 'puts URI.parse(gets.chomp).path' )

或

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->path' )

Answer 8

这是怎么回事？？

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
sed 's|.*://[^/]*/\([^?]*\)?.*|/\1|g'

。：// [^ /] /：http://login:password@example.com/
（[^？] *）：一个/多个/ dir / file.exe
？。*：？a = sth＆amp; b = sth
/ \ 1：/one/more/dir/file.exe

Answer 9

我同意“剪切”是命令行上的一个很棒的工具。但是，更纯粹的bash解决方案是在bash中使用可变扩展的强大功能。例如：

pass_first_last='password,firstname,lastname'

pass=${pass_first_last%%,*}

first_last=${pass_first_last#*,}

first=${first_last%,*}

last=${first_last#*,}

or, alternatively,

last=${pass_first_last##*,}

Answer 10

我写了一个函数来提取任何部分或URL。我只用bash测试过它。用法：

url_parse <url> [url-part]

示例：

$ url_parse "http://example.com:8080/home/index.html" path
home/index.html

代码：

url_parse() {
  local -r url=$1 url_part=$2
  #define url tokens and url regular expression
  local -r protocol='^[^:]+' user='[^:@]+' password='[^@]+' host='[^:/?#]+' \
    port='[0-9]+' path='\/([^?#]*)' query='\?([^#]+)' fragment='#(.*)'
  local -r auth="($user)(:($password))?@"
  local -r connection="($auth)?($host)(:($port))?"
  local -r url_regex="($protocol):\/\/($connection)?($path)?($query)?($fragment)?$"
  #parse url and create an array
  IFS=',' read -r -a url_arr <<< $(echo $url | awk -v OFS=, \
    "{match(\$0,/$url_regex/,a);print a[1],a[4],a[6],a[7],a[9],a[11],a[13],a[15]}")

  [[ ${url_arr[0]} ]] || { echo "Invalid URL: $url" >&2 ; return 1 ; }

  case $url_part in
    protocol) echo ${url_arr[0]} ;;
    auth)     echo ${url_arr[1]}:${url_arr[2]} ;; # ex: john.doe:1234
    user)     echo ${url_arr[1]} ;;
    password) echo ${url_arr[2]} ;;
    host-port)echo ${url_arr[3]}:${url_arr[4]} ;; #ex: example.com:8080
    host)     echo ${url_arr[3]} ;;
    port)     echo ${url_arr[4]} ;;
    path)     echo ${url_arr[5]} ;;
    query)    echo ${url_arr[6]} ;;
    fragment) echo ${url_arr[7]} ;;
    info)     echo -e "protocol:${url_arr[0]}\nuser:${url_arr[1]}\npassword:${url_arr[2]}\nhost:${url_arr[3]}\nport:${url_arr[4]}\npath:${url_arr[5]}\nquery:${url_arr[6]}\nfragment:${url_arr[7]}";;
    "")       ;; # used to validate url
    *)        echo "Invalid URL part: $url_part" >&2 ; return 1 ;;
  esac
}

Answer 11

仅使用bash builtins：

path="/${url#*://*/}" && [[ "/${url}" == "${path}" ]] && path="/"

这是做什么的：

删除前缀*://*/（因此这将是您的协议和主机名+端口）
检查我们是否真的成功删除了任何内容 - 如果没有，那么这意味着没有第三个斜杠（假设这是一个格式正确的URL）
如果没有第三个斜杠，那么路径只是/

注意：这里实际上并不需要引号，但我发现在

Answer 12

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"

GNU `grep`

$ grep -Po '\w\K/\w+[^?]+' <<<$url
/one/more/dir/file.exe

BSD `grep`

$ grep -o '\w/\w\+[^?]\+' <<<$url | tail -c+2
/one/more/dir/file.exe

ripgrep

$ rg -o '\w(/\w+[^?]+)' -r '$1' <<<$url
/one/more/dir/file.exe

要获取网址的其他部分，请检查：Getting parts of a URL (Regex)。

Answer 13

这个perl单行程序在我的命令行上工作，因此可以添加到您的脚本中。

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | perl -n -e 'm{http://[^/]+(/[^?]+)};print $1'

请注意，这假设总会有'？'

。要在字符串末尾添加字符。

在bash脚本中从URL中提取文件名和路径

13 个答案:

GNU `grep`

BSD `grep`

ripgrep

在bash脚本中从URL中提取文件名和路径

13 个答案:

GNU grep

BSD grep

ripgrep

GNU `grep`

BSD `grep`