使用正则表达式提取Hashtags

时间:2017-07-28 06:19:32

标签: elixir

测试字符串:

str = "#www #SoulMusic #50_shades_of_Blue # ##WorldWideWeb 
      #okie_dokkie #fr!ends #!alPacino #wonderfulRide 
      #good#club #rhônealpes #trèsbon #øypålandet http://example.com/#comment 
      #moreTags #www nobody #h3y!boy #EMAIL"

这就是我的尝试:

String.split(str, ~r/\B(#[á-úÁ-Úä-üÄ-Üa-zA-Z0-9_]+)/, trim: true, 
            include_captures: true)

但它并没有排除网址中的主题标签以及我收到的内容:

["#www", " ", "#SoulMusic", " ", "#50_shades_of_Blue", " # #", "#WorldWideWeb", " ", "#okie_dokkie", " ", "#fr", "!ends #!alPacino ", "#wonderfulRide", " ", "#good", "#club ", "#rhônealpes", " ", "#trèsbon", " ", "#øypålandet", " http://example.com/", "#comment", " ", "#moreTags", " ", "#www", " nobody ", "#h3y", "!boy ", "#EMAIL"]

我的目标是:

["#www", "#SoulMusic", "#50_shades_of_Blue", "#WorldWide",
"#okie_dokkie", "#fr", "wonderfulRide", "#good",
"#rhônealpes", "#trèsbon", "#øypålandet", "#moreTags", "#www", 
"#h3y", "#EMAIL"]

对此有任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:2)

如果您只需要匹配,则需要查找Regex.scan/2

iex(1)> str = "#www #SoulMusic #50_shades_of_Blue # ##WorldWideWeb
...(1)>       #okie_dokkie #fr!ends #!alPacino #wonderfulRide
...(1)>       #good#club #rhônealpes #trèsbon #gøypålandet http://example.com/#comment
...(1)>       #moreTags #www nobody #EMAIL"
"#www #SoulMusic #50_shades_of_Blue # ##WorldWideWeb \n      #okie_dokkie #fr!ends #!alPacino #wonderfulRide \n      #good#club #rhônealpes #trèsbon #gøypålandet http://example.com/#comment \n      #moreTags #www nobody #EMAIL"
iex(2)> Regex.scan(~r/\B#[á-úÁ-Úä-üÄ-Üa-zA-Z0-9_]+/, str)
[["#www"], ["#SoulMusic"], ["#50_shades_of_Blue"], ["#WorldWideWeb"],
 ["#okie_dokkie"], ["#fr"], ["#wonderfulRide"], ["#good"], ["#rhônealpes"],
 ["#trèsbon"], ["#gøypålandet"], ["#comment"], ["#moreTags"], ["#www"],
 ["#EMAIL"]]

这将返回列表列表。您可以使用Enum.concat/1来展平它以获取字符串列表:

iex(3)> Regex.scan(~r/\B#[á-úÁ-Úä-üÄ-Üa-zA-Z0-9_]+/, str) |> Enum.concat
["#www", "#SoulMusic", "#50_shades_of_Blue", "#WorldWideWeb", "#okie_dokkie",
 "#fr", "#wonderfulRide", "#good", "#rhônealpes", "#trèsbon",
 "#gøypålandet", "#comment", "#moreTags", "#www", "#EMAIL"]