使用正则表达式

时间:2018-07-18 17:59:44

标签: r regex

我正在使用具有包含特殊字符的自由文本的数据集。在将strsplit用于后续功能之前,我需要清除文本,但还是希望在特殊字符之前添加转义(\\),而不是完全删除它们。

例如,如下所示的字符串:

  

你喜欢长发吗?收起来!卷发就是礼物   释放它们,并通过弹跳帮助保持自然卷曲   定义。清洁头发时不压头发,同时减少   卷曲注入菠萝,摩洛哥坚果油和藜麦。让你自然   美丽闪耀光芒!

应如下所示:

  

你喜欢长发\\吗?绕起来\\!卷发是你的礼物   释放它们,并通过弹跳帮助保持自然卷曲   定义。清洁头发时不压头发,同时减少   卷曲注入菠萝,摩洛哥坚果油和藜麦。让你自然   美丽闪耀\\!

我已经找到了如何删除几个特殊字符(~!@#$%^&*(){}|<>/)的列表,但是找不到在它们之前添加\\的正确教程。

注意:我不想删除所有标点符号,因为某些字符用于后续的定界逻辑。相反,我想解决特殊字符的特定子集。

样本数据:

>dput(tar$clean.text[1:10])
list(c("Dove go fresh Cucumber and Green Tea Beauty Bar combines the refreshing scent of cucumber and green tea with Dove's gentle cleansers and _ moisturizing cream. Dove Beauty Bar is proven to be more gentle and mild on skin than ordinary soap. It can be used on your hands and as a mild facial cleanser, so if you're also after a fresh face and refreshed hands throughout the day, why not try adding Dove Beauty Bar go fresh Cucumber and Green Tea to your skin care routine? Light, hydrating feel and refreshing formula that effectively nourishes skin. A refreshing shower can be just what you need to start the day off right. Dove's go fresh range blends nourishing ingredients and light, fresh scents in a formula that's gentle on your skin. Dove go fresh beauty bars give you a feeling of hydrating freshness that leaves you and your skin feeling blissfully revived. For best results: Your hands are one of the driest parts of your body so give them a boost and lather your Dove beauty bar between wet hands. Once you've covered your body with the rich lather, making sure to avoid contact with your eyes, rinse away thoroughly. At Dove, our vision is of a world where beauty is a source of confidence, and not anxiety. So, we are on a mission to help the next generation of women develop a positive relationship with the way they look - helping them raise their self-esteem and realize their full potential.", 
    "Scent: Cucumber", "Health Facts: Sulfate-free", "Suggested Age: 5 Years and Up", 
    "Wellness Standard: Aluminum-free, paraben-free", "Recommended Skin Type: Normal", 
    "Beauty Purpose: Moisturizing, basic cleansing", "Package Quantity: 1", 
    "TCIN: 10819409", "UPC: 011111611023", "Item Number (DPCI): 049-00-0604"
    ), c("Me! Bath Bath Bomb Papaya Nectar 6 ct is a great idea to add to a spa gift basket. These bath bombs are like scoops for your bath to make mini bath ice cream that gives you super soft skin.", 
    "Scent: Fruit", "Health Facts: Vegan, paraben-free, aluminum-free", 
    "Product Form: Bath bomb", "Suggested Age: Adult Use Only", "Wellness Standard: Aluminum-free, cruelty-free, paraben-free, vegan", 
    "Recommended Skin Type: Normal", "Sustainability Claims: Cruelty-free", 
    "TCIN: 18828570", "UPC: 858858000358", "Item Number (DPCI): 037-08-1164"
    ), NA_character_, NA_character_, c("Aura Cacia pure essential oils in 4 fl oz Body Oil has a lavender and cocoa butter scent. This natural skin care oil shows skin tone improvement that you can feel.", 
    "Scent: Lavender, Cocoa Butter", "Health Facts: Contains lavender, butylparaben-free, phthalate-free, formaldehyde donor-free, formaldehyde-free, nonylphenol ethoxylate free, propylparaben-free, Sulfate-free, paraben-free, dye-free, aluminum-free", 
    "Product Form: Lotion", "Suggested Age: All Ages", "Recommended Skin Type: Normal", 
    "Beauty Purpose: Skin tone improvement", "Sustainability Claims: Not tested on animals, cruelty-free", 
    "TCIN: 50030689", "UPC: 051381911720", "Item Number (DPCI): 037-05-1378"
    ), c("Deep clean pores with the Facial Cleansing Brush from Eco", 
    "Tools. This compact brush features soft bristles for moderate exfoliation, leaving you with soft, supple skin. Your serums and moisturizers can more effectively penetrate your skin once all the dead skin cells are out of the way. The compact size is ideal for packing in your weekend tote or suitcase for cleansing on the go.", 
    "Material: Nylon", "Suggested Age: All Ages", "Beauty Purpose: Basic cleansing, exfoliating", 
    "TCIN: 52537254", "UPC: 079625074864", "Item Number (DPCI): 037-08-2254"
    ), c("Deep Steep Rosemary Mint Sugar Scrub gently exfoliates dead skin cells while moisturizing, leaving smooth, radiant, polished skin. This formula is made up of a smooth blend of shea butter, cocoa butter and carefully sourced sugar to give you light, blissful fragrance with just the right amount of exfoliation and no harsh scratching. Apply desired amount of Deep Steep Rosemary Mint Sugar Scrub to wet skin from shoulders to ankles. Massage in a circular motion. Rinse.", 
    "Scent: Rosemary", "Health Facts: Contains argan oil, contains coconut oil, contains shea butter, formaldehyde donor-free, gluten-free, dye-free, ethyl alcohol-free, paraben-free, phthalate-free, vegan", 
    "Product Form: Scrub", "Suggested Age: All Ages", "Recommended Skin Type: Dry, normal", 
    "Beauty Purpose: Exfoliating", "TCIN: 53242409", "UPC: 674749101153", 
    "Item Number (DPCI): 037-08-2123"), NA_character_, c("Want to feel gorgeously soft skin every day? Transform your daily shower into an irresistible treat with the exquisitely fragranced Caress Evenly Gorgeous body wash. Indulge your skin with a rich exfoliating lather delicately scented with burnt brown sugar and karite butter that makes this body wash smell good enough to eat. Subtle notes of soft crisp apple and berry open up to a bold floral heart, while rich scents of warm tonka bean, vanilla and balsam together round out the lush lather to leave you with perfectly buffed and glowing skin. Caress Evenly Gorgeous is a revitalizing body wash that blends rich, luxurious lather with expertly crafted fine fragrance It is a body wash that gently cleanses your skin to leave it delicately fragrant, beautifully soft.", 
    "Lather up and indulge in a deeply cleansing and reviving shower experience. With fine floral fragrance and gentle exfoliates, Caress Evenly Gorgeous will leave you feeling delicately perfumed and silky-smooth, making this the perfect body wash for every day? and every night. Caress body wash and beauty bar fragrances are crafted by the world's best perfumers to transform your daily shower into an indulging experience that will make you feel special every day?Scent: Fresh", 
    "Health Facts: Aluminum-free, paraben-free, fluoride-free", "Product Form: Liquid", 
    "Suggested Age: 5 Years and Up", "Wellness Standard: Aluminum-free, paraben-free", 
    "Recommended Skin Type: Normal", "Beauty Purpose: Basic cleansing", 
    "Package Quantity: 1", "TCIN: 13446229", "UPC: 011111014909", 
    "Item Number (DPCI): 049-00-0806"), c("Maintain a sanitary and healthy atmosphere with the MEDLINE n/a READYBATH, PREMIUM,FRAG FREE, 8/PK - 24pks. These sterile swab sticks are pre-treated with povidone-iodine for preparing skin for incision and other medical issues. Comes in disposable packages of 3.", 
    "Scent: Unscented", "Health Facts: No fragrance added", "Suggested Age: Adult Use Only", 
    "Recommended Skin Type: Normal", "Beauty Purpose: Basic cleansing", 
    "Package Quantity: 1", "TCIN: 14339945", "UPC: 080196731445", 
    "Item Number (DPCI): 037-13-0198"))`

删除符号列表的代码: tar$clean.text<-str_replace_all(tar$clean.text, "~|!|@|#|$|%|^|&|\\*|\\(|\\)|\\{|\\}|_|\\\\|<|>|\\?|\\[|\\]|-", "") # Removes a ton of non-UTF characters

我确定对我的正则表达式进行了简单的修改,但似乎无法弄清楚。我以前发现的所有答案都更具体地用于修复特定的文本模式,而不是通常在许多不同的变体中进行替换。

2 个答案:

答案 0 :(得分:2)

您可以使用

str_replace_all(x, "[~!@#$%^&*(){}_\\\\<>?\\[\\]|-]", "\\\\\\0")

基本R方法:

gsub("([]\\~!@#$%^&*(){}_<>?[|-])", "\\\\\\1", "~!@#$%^&*(){}_\\<>?[]|-")

请参见regex demo

详细信息

  • [-匹配以下任何字符的字符类的开始:
    • ~-~
    • !-!
    • @-@
    • #-#
    • $-$
    • %-%
    • ^-^(如果您将其放在开头,请使用\\逃脱)
    • &-&
    • *-*(无需在角色类中转义)
    • (-(
    • )-)
    • {-{
    • }-}
    • _-_(请注意,它是一个字符char,而\W则不匹配)
    • \\\\-一个\字符(一个文字\和另一个文字\的转义符)
    • <-一个<
    • >->
    • ?-?
    • \\[-一个[字符(在ICU正则表达式中,必须在字符类内转义
    • \\]-一个]字符(同上)
    • |-一个|字符(它不是字符类中的OR运算符)
    • --一个-字符
  • ]-字符类的结尾。

"\\\\\\0"字符串替换模式被解析为两个文字反斜杠,两个反斜杠定义了一个单数文字反斜杠,一个\0文字字符串作为对R中ICU regex中整个匹配的反向引用。 >

请注意,gsub TRE regex有点棘手:]必须是字符类中的第一个字符,[不应被转义,文字\仅应必须是单个(TRE模式中不支持正则表达式转义序列),并且-必须在末尾。另外,不支持整个匹配的反向引用,因此,您需要用捕获组包装整个模式,并用\1反向引用替换。

答案 1 :(得分:-1)

dat = tar$clean.text[1:10]可以执行以下操作:

 Map(gsub,"([[:punct:]])","\\\\\\1",dat)