R - 字符串中间的子串短语

时间:2016-05-11 19:32:11

标签: regex r dataframe substring substr

我想在R中的字符串中间输出一个短语。字符串是数据框中的一个字段。它的数据框看起来像这样:

Common.name price description
Animal 1    $50   Field Collected\nRoughly 2-3 Inches In Length\nVibrant Red Coloration\nWill Do Fine In Groups\nFeeding On Various Vegetation & Fruits\nSizes Range From 1-2.5 Feet In Total Length\nField Collected\nSizes Vary From Juvenile ...
Animal 2    $40   Captive Bred\nApproximately 10-12 Inches In Length\nBeautiful Hybrid to Add To Your Collection\nInsane Patterning And Beautiful Vibrant Colors\nMales And Females...
...    ...  ...   ...
Animal 500  $29   Field Collected Beauties\nMales And Females Available\nApproximate Length: no bigger than a penny\nAmazingly Friendly! Make Great Pets!\nOnly Reach About 9 Inches At Most!\nFeeding On Vitamin Dusted Greens And ...

我想将每个描述中的长度和单位提取为新字段。对于动物1,它将在一个场中为2-3而在另一个场中为英寸。对于动物500来说,长度将是"不超过一分钱"而且单位字段是NA。

我怎样才能在R?

中这样做

1 个答案:

答案 0 :(得分:1)

描述

此正则表达式将执行以下操作:

  • Animal开头,后跟数字
  • 的匹配行
  • 捕获动物编号
  • 在字段中的某处找到第一个字段length
  • 如果长度表示为数字
    • 将长度捕获为单个数字234或一系列数字3-342
    • 假设数字后面的字符串是度量单位
  • 如果长度表示为一些奇怪的文字
    • :
    • 之后捕获所有内容
    • 将UnitOfMeasure保留为null

正则表达式

^(?<Animal>Animal\s[0-9]+)\s+\S+\s+(?:(?:(?!\\n|$).)*\\n)*?(?=(?:(?!\\n).)*Length)(?:(?:(?!\\n).)*?(?<Length>[0-9]+\s*(?:-\s*[0-9]+)?)\s+(?<UnitOfMeasure>\S+)|(?:(?!\\n).)*?Length:\s*(?<Length>(?:(?!\\n).)*))?

Regular expression visualization

备注

  • 我使用了以下标志:multiline,global和allow duplicate subpattern names
  • 我不确定源文本中的\n字符串是字面\n还是表示返回字符。因此,构造此正则表达式假设它们实际上是\字符,后跟n字符。如果您将这些字符表示为新的字符,请将所有\\n更改为正则表达式中的\n

实施例

实例

https://regex101.com/r/nL1fW1/2

示例输入文字

Common.name price description
Animal 1    $50   Field Collected\nRoughly 2-3 Inches In Length\nVibrant Red Coloration\nWill Do Fine In Groups\nFeeding On Various Vegetation & Fruits\nSizes Range From 1-2.5 Feet In Total Length\nField Collected\nSizes Vary From Juvenile ...
Animal 2    $40   Captive Bred\nApproximately 10-12 Inches In Length\nBeautiful Hybrid to Add To Your Collection\nInsane Patterning And Beautiful Vibrant Colors\nMales And Females...
Animal 3    $40   Captive Bred\nApproximately 10 Inches In Length\nBeautiful Hybrid to Add To Your Collection\nInsane Patterning And Beautiful Vibrant Colors\nMales And Females...
...    ...  ...   ...
Animal 500  $29   Field Collected Beauties\nMales And Females Available\nApproximate Length: no bigger than a penny\nAmazingly Friendly! Make Great Pets!\nOnly Reach About 9 Inches At Most!\nFeeding On Vitamin Dusted Greens And ...

样本匹配

[0][0] = Animal 1    $50   Field Collected\nRoughly 2-3 Inches
[0][Animal] = Animal 1
[0][Length] = 2-3
[0][UnitOfMeasure] = Inches

[1][0] = Animal 2    $40   Captive Bred\nApproximately 10-12 Inches
[1][Animal] = Animal 2
[1][Length] = 10-12
[1][UnitOfMeasure] = Inches

[2][0] = Animal 3    $40   Captive Bred\nApproximately 10 Inches
[2][Animal] = Animal 3
[2][Length] = 10
[2][UnitOfMeasure] = Inches

[3][0] = Animal 500  $29   Field Collected Beauties\nMales And Females Available\nApproximate Length: no bigger than a penny
[3][Animal] = Animal 500
[3][UnitOfMeasure] = 
[3][Length] = no bigger than a penny

解释

这是从上面的实时链接中的说明字段中复制的。

^ assert position at start of a line
(?<Animal>Animal\s[0-9]+) Named capturing group Animal
Animal matches the characters Animal literally (case sensitive)
\s match any white space character [\r\n\t\f ]
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\S+ match any non-white space character [^\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
(?:(?:(?!\\n|$).)*\\n)*? Non-capturing group
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
(?:(?!\\n|$).)* Non-capturing group
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?!\\n|$) Negative Lookahead - Assert that it is impossible to match the regex below
1st Alternative: \\n
\\ matches the character \ literally
n matches the character n literally (case sensitive)
2nd Alternative: $
$ assert position at end of a line
. matches any character (except newline)
\\ matches the character \ literally
n matches the character n literally (case sensitive)
(?=(?:(?!\\n).)*Length) Positive Lookahead - Assert that the regex below can be matched
(?:(?!\\n).)* Non-capturing group
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?!\\n) Negative Lookahead - Assert that it is impossible to match the regex below
\\ matches the character \ literally
n matches the character n literally (case sensitive)
. matches any character (except newline)
Length matches the characters Length literally (case sensitive)
(?:(?:(?!\\n).)*?(?<Length>[0-9]+\s*(?:-\s*[0-9]+))\s+(?<UnitOfMeasure>\S+)|(?:(?!\\n).)*?Length:\s*(?<Length>(?:(?!\\n).)*))? Non-capturing group
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
1st Alternative: (?:(?!\\n).)*?(?<Length>[0-9]+\s*(?:-\s*[0-9]+))\s+(?<UnitOfMeasure>\S+)
(?:(?!\\n).)*? Non-capturing group
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
(?!\\n) Negative Lookahead - Assert that it is impossible to match the regex below
\\ matches the character \ literally
n matches the character n literally (case sensitive)
. matches any character (except newline)
(?<Length>[0-9]+\s*(?:-\s*[0-9]+)?) Named capturing group Length
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?:-\s*[0-9]+)? Non-capturing group
- matches the character - literally
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
(?<UnitOfMeasure>\S+) Named capturing group UnitOfMeasure
\S+ match any non-white space character [^\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Alternative: (?:(?!\\n).)*?Length:\s*(?<Length>(?:(?!\\n).)*)
(?:(?!\\n).)*? Non-capturing group
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
(?!\\n) Negative Lookahead - Assert that it is impossible to match the regex below
\\ matches the character \ literally
n matches the character n literally (case sensitive)
. matches any character (except newline)
Length: matches the characters Length: literally (case sensitive)
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?<Length>(?:(?!\\n).)*) Named capturing group Length
(?:(?!\\n).)* Non-capturing group
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?!\\n) Negative Lookahead - Assert that it is impossible to match the regex below
\\ matches the character \ literally
n matches the character n literally (case sensitive)
. matches any character (except newline)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
g modifier: global. All matches (don't return on first match)
J modifier: Allow duplicate subpattern names