我正在处理一些数据,这些数据需要我使用regex
组合strsplit
函数。我已经弄清楚了如何分割字符串,但是正在努力应用this post中关于保留定界符的指导。
这是我要抓取的字符串的示例:
text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
而且,这是成功分割字符串但修剪定界符的代码:
strsplit(as.character(free_text), "[0-9](?=[A-Z])|[a-z](?=[A-Z])|[')'](?=[A-Z])", perl=TRUE)
您会注意到,我正在寻找以下地方:
不幸的是,下面的输出显示了我的代码存在的问题:
[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembl"
[2] "Material: Woo"
[3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L"
[4] "Weight: 6.0 pound"
[5] "Holds up to: 20.0 pound"
[6] "Intended Pet Type: Bir"
[7] "Care and Cleaning: Hand was"
[8] "Pet activity: Clim"
[9] "TCIN: 1670783"
[10] "UPC: 03017202559"
[11] "Item Number (DPCI): 083-01-024"
[12] "Report incorrect product information"
,即,最后一个字母从assemble [1]
,Wood [2]
等处修剪掉。在寻找像我这样的正则表达式组合时,如何保持定界符?
答案 0 :(得分:2)
您可以将正则表达式中的使用模式放到后面:
> text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
> strsplit(text, "(?<=[0-9])(?=[A-Z])|(?<=[a-z])(?=[A-Z])|(?<=\\))(?=[A-Z])", perl=TRUE)
[[1]]
[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembly"
[2] "Material: Wood"
[3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)"
[4] "Weight: 6.0 pounds"
[5] "Holds up to: 20.0 pounds"
[6] "Intended Pet Type: Bird"
[7] "Care and Cleaning: Hand wash"
[8] "Pet activity: Climb"
[9] "TCIN: 16707835"
[10] "UPC: 030172025594"
[11] "Item Number (DPCI): 083-01-0246"
[12] "Report incorrect product information"
[0-9]
转换为(?<=[0-9])
,[a-z]
现在是(?<=[a-z])
,[')']
现在是(?<=\))
。
请注意,(?<=...)
是一个正数lookbehind,它与字符串中的位置匹配,该位置紧随其后的字符串中定义的某种模式。