Question

我的文字结构如下：

book_name：SoftwareEngineering; author：John; author：Smith; book_name：DesignPatterns; author：Foo; author：Bar;

元素分隔符为;

两个author元素可以跟随book_name元素

可能有2到10本书

一本书应该至少有一位作者，但最多2位作者

我想为每本书提取book_name和个人作者。

我使用.scan方法（收集所有匹配项）尝试了正则表达式：

iex> regex = ~r/book_name:(.+?;)(author:.+?;){1,2}/
iex> text = "book_name:SoftwareEngineering;author:John;author:Smith;book_name:DesignPatterns;author:Foo;author:Bar;"

iex> Regex.scan(regex, text, capture: :all_but_first)
[["SoftwareEngineering;", "author:Smith;"], ["DesignPatterns;", "author:Bar;"]]

但是它不能正确收集作者。它仅收集该书的第二作者。有人可以帮忙解决这个问题吗？

Answer 1

在包括Elixir在内的许多引擎中，您不能重复这样的多个捕获组并获得每个重复组的结果-您只会得到任何给定重复捕获组的最后一个结果。相反，分别写出每个可能的组，然后过滤出空的匹配项：

book_name:(.+?;)author:(.+?);(?:author:(.+?);)?

https://regex101.com/r/LPgzcG/1

Answer 2

您不需要正则表达式，可以使用String.split/3：

defmodule Book do
  def extract(text) do
    text
    |> String.split("book_name:", trim: true)
    |> Enum.map(&String.split(&1, [":", ";"], trim: true))
    |> Enum.map(fn [title, _, author1, _, author2] -> {title, author1, author2} end)
  end
end

输出：

iex> Book.extract(text)
[{"SoftwareEngineering", "John", "Smith"}, {"DesignPatterns", "Foo", "Bar"}]

为简单起见，我假设总是有两位作者。最后一个枚举可以替换为该枚举，该枚举也可以处理没有第二个作者的情况：

|> Enum.map(fn
  [title, _, author1] -> {title, author1, nil}
  [title, _, author1, _, author2] -> {title, author1, author2}
end)

Answer 3

模式的这部分(author:.+?;){1,2}重复1-2次author，包括后续直到分号的内容，但是重复这样的捕获组只会给您最后一个捕获组。 This page可能会有所帮助。

除了使用非贪婪量词.*?之外，您还不能匹配重复不与分号匹配的否定字符类[^;]+的分号。

您还可以使用捕获组和author的反向引用。这本书的名称在第1组中，在第3组中是第一位作者的名字，在第4组中是可选的第二位作者。

book_name:([^;]+);(author):([^;]+);(?:\2:([^;]+);)?

这将匹配

book_name:字面上匹配
([^;]+);组1不匹配;，然后匹配;
(author):第2组author
([^;]+);组3不匹配，;，然后匹配;
(?:非捕获组
- \2:向后引用第2组中捕获的内容
- ([^;]+);组4不匹配的;，然后匹配;的
)?关闭非捕获组并将其设置为可选

regex101 demo

正则表达式匹配1或2次出现

3 个答案: