在Julia中截断字符串

时间:2016-09-15 01:23:53

标签: string julia

是否有将字符串截断到一定长度的便利功能?

它相当于这样的东西

test_str = "test"
if length(test_str) > 8
   out_str = test_str[1:8]
else
   out_str = test_str
end

6 个答案:

答案 0 :(得分:7)

在天真的ASCII世界中:

truncate_ascii(s,n) = s[1:min(sizeof(s),n)]

会这样做。如果最好与原始字符串共享内存并避免复制SubString可以使用:

truncate_ascii(s,n) = SubString(s,1,min(sizeof(s),n))

但是在Unicode世界中(它是一个Unicode世界),这更好:

truncate_utf8(s,n) = SubString(s,1, (eo=endof(s) ; neo=0 ; 
  for i=1:n 
    if neo<eo neo=nextind(s,neo) ; else break ; end ;
  end ; neo) )

最后,@IsmaelVenegasCastelló提醒我们grapheme复杂性(arrrgh),然后这就是所需要的:

function truncate_grapheme(s,n)
    eo = endof(s) ; tt = 0 ; neo=0
    for i=1:n
        if (neo<eo)
            tt = nextind(s,neo)
            while neo>0 && tt<eo && !Base.UTF8proc.isgraphemebreak(s[neo],s[tt])
                (neo,tt) = (tt,nextind(s,tt))
            end
            neo = tt
        else
            break
        end
    end
    return SubString(s,1,neo)
end

最后两个实现尝试避免计算length(可能很慢)或分配/复制,甚至只在n较短时循环length次。

这个答案借鉴了@MichaelOhlrogge,@ FengyangWang,@ Oxinabox和@IsmaelVenegasCastelló

的贡献

答案 1 :(得分:4)

我会做strtruncate(str, n) = join(take(str, n))

示例:

julia> strtruncate("αβγδ", 3)
"αβγ"

julia> strtruncate("αβγδ", 5)
"αβγδ"

请注意,您的代码对Unicode字符串不完全有效。

答案 2 :(得分:3)

如果字符串是ASCII,这非常有效:

String(resize!(str.data, n))

或就地:

resize!(str.data, n)

对于unicode,@ Fengyang Wangs的方法速度非常快,但如果你只截断字符串的最后一部,转换为Char数组可能会稍快一些:

trunc1(str::String, n) = String(collect(take(str, n)))
trunc2(str::String, n) = String(Vector{Char}(str)[1:n])
trunc3(str::String, n) = String(resize!(Vector{Char}(str), n))
trunc4(str::String, n::Int)::String = join(collect(graphemes(str))[1:n])

function trunc5(str::String, n)
    if isascii(str)
        return String(resize!(str.data, n))
    else
        trunc1(str, n)
    end
end

定时:

julia> time_trunc(100, 100000, 25)
  0.112851 seconds (700.00 k allocations: 42.725 MB, 7.75% gc time)
  0.165806 seconds (700.00 k allocations: 91.553 MB, 11.84% gc time)
  0.160116 seconds (600.00 k allocations: 73.242 MB, 11.58% gc time)
  1.167706 seconds (31.60 M allocations: 1.049 GB, 11.12% gc time)
  0.017833 seconds (100.00 k allocations: 1.526 MB)
true
julia> time_trunc(100, 100000, 98)
  0.367191 seconds (700.00 k allocations: 83.923 MB, 5.23% gc time)
  0.318507 seconds (700.00 k allocations: 132.751 MB, 9.08% gc time)
  0.301685 seconds (600.00 k allocations: 80.872 MB, 6.19% gc time)
  1.561337 seconds (31.80 M allocations: 1.122 GB, 9.86% gc time)
  0.061827 seconds (100.00 k allocations: 1.526 MB)
true

编辑:哎呀......我刚才意识到我实际上正在摧毁trunc5中的原始字符串。这应该是正确的,但性能较差:

function trunc5(str::String, n)
    if isascii(str)
        return String(str.data[1:n])
    else
        trunc1(str, n)
    end
end

新时间:

julia> time_trunc(100, 100000, 25)
  0.123629 seconds (700.00 k allocations: 42.725 MB, 7.70% gc time)
  0.162332 seconds (700.00 k allocations: 91.553 MB, 11.41% gc time)
  0.152473 seconds (600.00 k allocations: 73.242 MB, 9.19% gc time)
  1.152640 seconds (31.60 M allocations: 1.049 GB, 11.54% gc time)
  0.066662 seconds (200.00 k allocations: 12.207 MB)
true

julia> time_trunc(100, 100000, 98)
  0.369576 seconds (700.00 k allocations: 83.923 MB, 5.10% gc time)
  0.312237 seconds (700.00 k allocations: 132.751 MB, 9.42% gc time)
  0.297736 seconds (600.00 k allocations: 80.872 MB, 5.95% gc time)
  1.545329 seconds (31.80 M allocations: 1.122 GB, 10.02% gc time)
  0.080399 seconds (200.00 k allocations: 19.836 MB, 5.07% gc time)
true

Aaand new edit: Aargh,忘记了计时功能。我输入了一个ascii字符串:

function time_trunc(m, n, m_)
    str = randstring(m)
    @time for _ in  1:n trunc1(str, m_) end
    @time for _ in  1:n trunc2(str, m_) end
    @time for _ in  1:n trunc3(str, m_) end
    @time for _ in  1:n trunc4(str, m_) end
    @time for _ in  1:n trunc5(str, m_) end
    trunc1(str, m_) == trunc2(str, m_) == trunc3(str, m_) == trunc4(str, m_) == trunc5(str, m_)
end

最终编辑(我希望): 试用@Dan Getz&#39; truncate_grapheme并使用unicode字符串:

function time_trunc(m, n, m_)
    # str = randstring(m)
    str = join(["αβγπϕ1t_Ω₃!" for i in 1:100])
    @time for _ in  1:n trunc1(str, m_) end
    @time for _ in  1:n trunc2(str, m_) end
    @time for _ in  1:n trunc3(str, m_) end
    # @time for _ in  1:n trunc4(str, m_) end  # too slow
    @time for _ in  1:n trunc5(str, m_) end
    @time for _ in  1:n truncate_grapheme(str, m_) end
    trunc1(str, m_) == trunc2(str, m_) == trunc3(str, m_) == trunc5(str, m_) == truncate_grapheme(str, m_)
end

定时:

julia> time_trunc(100, 100000, 98)
  0.690399 seconds (800.00 k allocations: 103.760 MB, 3.69% gc time)
  1.828437 seconds (800.00 k allocations: 534.058 MB, 3.66% gc time)
  1.795005 seconds (700.00 k allocations: 482.178 MB, 3.19% gc time)
  0.667831 seconds (800.00 k allocations: 103.760 MB, 3.17% gc time)
  0.347953 seconds (100.00 k allocations: 3.052 MB)
true

julia> time_trunc(100, 100000, 25)
  0.282922 seconds (800.00 k allocations: 48.828 MB, 4.01% gc time)
  1.576374 seconds (800.00 k allocations: 479.126 MB, 3.98% gc time)
  1.643700 seconds (700.00 k allocations: 460.815 MB, 3.70% gc time)
  0.276586 seconds (800.00 k allocations: 48.828 MB, 4.59% gc time)
  0.091773 seconds (100.00 k allocations: 3.052 MB)
true

所以最后一个看起来显然是最好的(这篇文章现在太长了。)

答案 3 :(得分:2)

您可以使用:

"test"[1:min(end,8)]

另外

 SubString("test", 1, 8)

答案 4 :(得分:2)

您可以使用graphemes功能:

C:\Users\Ismael
λ julia5
               _
   _       _ _(_)_     |  By greedy hackers for greedy hackers.
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _' |  |
  | | |_| | | | (_| |  |  Version 0.5.0-rc3+0 (2016-08-22 23:43 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-w64-mingw32

help?> graphemes
search: graphemes

  graphemes(s) -> iterator over substrings of s

  Returns an iterator over substrings of s that correspond to the extended
  graphemes in the string, as defined by Unicode UAX #29.
  (Roughly, these are what users would perceive as single characters, even
  though they may contain more than one codepoint; for example a letter 
  combined with an accent mark is a single grapheme.)

实施例

julia> s = "αβγπϕ1t_Ω₃!"; n = 8;

julia> length(s)
11

julia> graphemes(s)
length-11 GraphemeIterator{String} for "αβγπϕ1t_Ω₃!"

julia> collect(ans)[1:n]
8-element Array{SubString{String},1}:
 "α"
 "β"
 "γ"
 "π"
 "ϕ"
 "1"
 "t"
 "_"

julia> join(ans)
"αβγπϕ1t_"

查看truncate功能:

julia> methods(truncate)
# 2 methods for generic function "truncate":
truncate(s::IOStream, n::Integer) at iostream.jl:43
truncate(io::Base.AbstractIOBuffer, n::Integer) at iobuffer.jl:140

help?> truncate
search: truncate

  truncate(file,n)

  Resize the file or buffer given by the first argument to exactly n bytes,
  filling previously unallocated space with '\0' if the file or buffer is 
  grown.

所以解决方案看起来像这样:

julia> @doc """
           truncate(s::String, n::Int)::String

       truncate a `String`; `s` up to `n` graphemes.

       # Example

       ```julia
       julia> truncate("αβγπϕ1t_Ω₃!", 8)
       "αβγπϕ1t_"

       julia> truncate("test", 8)
       "test"
       ```
       """ ->
       function Base.truncate(s::String, n::Int)::String
           if length(s) > n
               join(collect(graphemes(s))[1:n])
           else
               s
           end
       end
Base.truncate

测试它:

julia> methods(truncate)
# 3 methods for generic function "truncate":
truncate(s::String, n::Int64)
truncate(s::IOStream, n::Integer) at iostream.jl:43
truncate(io::Base.AbstractIOBuffer, n::Integer) at iobuffer.jl:140

help?> truncate
  truncate(file,n)

  Resize the file or buffer given by the first argument to exactly n bytes,
  filling previously unallocated space with '\0' if the file or buffer is 
  grown.

  truncate(s::String, n::Int)::String

  truncate a String; s up to n graphemes.

     Example
    ≡≡≡≡≡≡≡≡≡

  julia> truncate("αβγπϕ1t_Ω₃!", 8)
  "αβγπϕ1t_"

  julia> truncate("test", 8)
  "test"

julia> truncate("αβγπϕ1t_Ω₃!", n)
"αβγπϕ1t_"

julia> truncate("test", n)
"test"

简介:

julia> Pkg.add("BenchmarkTools")
INFO: Nothing to be done
INFO: METADATA is out-of-date — you may not have the latest version of BenchmarkTools
INFO: Use `Pkg.update()` to get the latest versions of your packages

julia> using BenchmarkTools

julia> @benchmark truncate("αβγπϕ1t_Ω₃!", 8)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.72 kb
  allocs estimate:  48
  minimum time:     1.96 μs (0.00% GC)
  median time:      2.10 μs (0.00% GC)
  mean time:        2.45 μs (7.80% GC)
  maximum time:     353.75 μs (98.40% GC)

julia> Sys.cpu_info()[]
Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz:
        speed         user       nice        sys       idle        irq ticks
     2494 MHz     937640          0     762890   11104468     144671 ticks

答案 5 :(得分:0)

这里可以处理任何UTF-8字符串:

function trim_str(str, max_length)
    edge = nextind(str, 0, max_length)
    if edge >= ncodeunits(str)
        str
    else
        str[1:edge]
    end
end