如何从Ruby中的字符串中删除URL?

时间:2017-09-09 19:04:22

标签: ruby regex url replace

我使用的是Ruby 2.4。我想从我的字符串中删除网址,所以我尝试了这个

puts "str before: #{my_str}"
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}"

但只有" http"被剥夺了。这是上面几行的输出

str before: Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROUP%20RESULTS.HTM" \l "Top)
str after url sub: Top (//www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROUP%20RESULTS.HTM" \l "Top)

从字符串中删除网址的正确方法是什么?

编辑:以下是我发生的事情' puts"#{URI :: regexp}"'

(?x-mi:
        ([a-zA-Z][\-+.a-zA-Z\d]*):                           (?# 1: scheme)
        (?:
           ((?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)                    (?# 2: opaque)
        |
           (?:(?:
             \/\/(?:
                 (?:(?:((?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)?        (?# 3: userinfo)
                   (?:((?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))(?::(\d*))?))? (?# 4: host, 5: port)
               |
                 ((?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)                 (?# 6: registry)
               )
             |
             (?!\/\/))                           (?# XXX: '\/\/' is the mark for hostport)
             (\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?                    (?# 7: path)
           )(?:\?((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?                 (?# 8: query)
        )
        (?:\#((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?                  (?# 9: fragment)
      )

1 个答案:

答案 0 :(得分:0)

对于常规字符串似乎工作正常:

my_str = "Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM\" \\l \"Top)"
puts "str before: #{my_str}"          # => str before: Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)    
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}"   # => str after url sub: Top (" \l "Top)

但是,你的可能会有一些垃圾,不可打印的字符。例如,在第一个斜杠之前的一个随机空字符:

#                   vv - random null character
my_str = "Top (http:\0//www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM\" \\l \"Top)"
#                                                looks the same vv
puts "str before: #{my_str}"          # => str before: Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}"   # => str after url sub: Top (//www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)

现在,如果您尝试从网站复制并粘贴此空字符的输出,它仍然有效:

# I copied this from the output from the line below `looks the same vv`
my_str = 'Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)'
puts "str before: #{my_str}"          # => str before: Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}"   # => str after url sub: Top (" \l "Top)

所以最终看起来它对我们有用。因此,您可以尝试删除所有不可打印的字符,看看它是否适合您:

my_str = "Top (http:\0//www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM\" \\l \"Top)"
my_str.gsub!(/[^[:print:]]/i, '')
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}"   # => str after url sub: Top (" \l "Top)