过滤(grep)并修剪data.frame中的列表以创建新变量

时间:2018-07-18 02:10:56

标签: r lapply

我有一个带变量(tar)的data.frame(clean.text),其中包含每一行的文本列表。

例如:

[[2]]
 [1] "Dove go fresh Cucumber and Green Tea Beauty Bar combines the refreshing scent of cucumber and green tea with Dove's gentle cleansers and _ moisturizing cream. Dove Beauty Bar is proven to be more gentle and mild on skin than ordinary soap. It can be used on your hands and as a mild facial cleanser, so if you're also after a fresh face and refreshed hands throughout the day, why not try adding Dove Beauty Bar go fresh Cucumber and Green Tea to your skin care routine? Light, hydrating feel and refreshing formula that effectively nourishes skin. A refreshing shower can be just what you need to start the day off right. Dove's go fresh range blends nourishing ingredients and light, fresh scents in a formula that's gentle on your skin. Dove go fresh beauty bars give you a feeling of hydrating freshness that leaves you and your skin feeling blissfully revived. For best results: Your hands are one of the driest parts of your body so give them a boost and lather your Dove beauty bar between wet hands. Once you've covered your body with the rich lather, making sure to avoid contact with your eyes, rinse away thoroughly. At Dove, our vision is of a world where beauty is a source of confidence, and not anxiety. So, we are on a mission to help the next generation of women develop a positive relationship with the way they look - helping them raise their self-esteem and realize their full potential."
 [2] "Scent: Cucumber"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
 [3] "Health Facts: Sulfate-free"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [4] "Suggested Age: 5 Years and Up"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 [5] "Wellness Standard: Aluminum-free, paraben-free"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [6] "Recommended Skin Type: Normal"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 [7] "Beauty Purpose: Moisturizing, basic cleansing"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 [8] "Package Quantity: 1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
 [9] "TCIN: 10819409"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[10] "UPC: 011111611023"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[11] "Item Number (DPCI): 049-00-0604"   

我已经构建了一个简单的grepl函数,该函数在与字符串匹配时将列表修整到相关行,然后(非常不客气地)将这些值写入变量。例如,我可能会过滤字符串“ Health Facts:”,然后将相关数据写入名为health_facts的变量中。

示例:

tar$health_facts = lapply(tar$clean.text, function(l) l[grepl(tar[[top_attribute_names]], l)])
tar$health_facts<-gsub(".*: ","",as.character(tar$health_facts))
tar$health_facts<-ifelse(tar$health_facts=='character(0)',NA,tar$health_facts)  # Lists that don't contain health facts will say character(0)

希望构建一个函数而不是复制粘贴相同的代码,然后将两个变量列表放在一起:我要grep的字符串以及相应变量的名称我想创建。

`top_attribute_names<-c("Item_Number (DPCI)","UPC","TCIN","Product Form","Package Quantity",
"Health Facts")`

> new_attribute_names<-c("DCPI","UPC","TCIN","product_form","package_quantity","health_facts")

我正在尝试编写一个简单的循环,该循环过滤文本字段中的所需值并将其写入新变量:

for (i in seq_along(top_attribute_names)) {
      new_list<-lapply(tar$clean.text, function(l) l[grepl(top_attribute_names[i], l)]) # Write items that match into new list
     tar[new_attribute_names[i]]<-new_list[1] # Only take first row, just in case there is a text duplicate
    }

更新 :当我引用列表中的项目而不是将文本粘贴到同一函数中时,grepl的工作方式不同。我只是在寻求帮助以使最后的工作正常进行,但肯定还有其他改进的机会。

样本数据:

>dput(tar$clean.text[1:10])
list(c("Dove go fresh Cucumber and Green Tea Beauty Bar combines the refreshing scent of cucumber and green tea with Dove's gentle cleansers and _ moisturizing cream. Dove Beauty Bar is proven to be more gentle and mild on skin than ordinary soap. It can be used on your hands and as a mild facial cleanser, so if you're also after a fresh face and refreshed hands throughout the day, why not try adding Dove Beauty Bar go fresh Cucumber and Green Tea to your skin care routine? Light, hydrating feel and refreshing formula that effectively nourishes skin. A refreshing shower can be just what you need to start the day off right. Dove's go fresh range blends nourishing ingredients and light, fresh scents in a formula that's gentle on your skin. Dove go fresh beauty bars give you a feeling of hydrating freshness that leaves you and your skin feeling blissfully revived. For best results: Your hands are one of the driest parts of your body so give them a boost and lather your Dove beauty bar between wet hands. Once you've covered your body with the rich lather, making sure to avoid contact with your eyes, rinse away thoroughly. At Dove, our vision is of a world where beauty is a source of confidence, and not anxiety. So, we are on a mission to help the next generation of women develop a positive relationship with the way they look - helping them raise their self-esteem and realize their full potential.", 
    "Scent: Cucumber", "Health Facts: Sulfate-free", "Suggested Age: 5 Years and Up", 
    "Wellness Standard: Aluminum-free, paraben-free", "Recommended Skin Type: Normal", 
    "Beauty Purpose: Moisturizing, basic cleansing", "Package Quantity: 1", 
    "TCIN: 10819409", "UPC: 011111611023", "Item Number (DPCI): 049-00-0604"
    ), c("Me! Bath Bath Bomb Papaya Nectar 6 ct is a great idea to add to a spa gift basket. These bath bombs are like scoops for your bath to make mini bath ice cream that gives you super soft skin.", 
    "Scent: Fruit", "Health Facts: Vegan, paraben-free, aluminum-free", 
    "Product Form: Bath bomb", "Suggested Age: Adult Use Only", "Wellness Standard: Aluminum-free, cruelty-free, paraben-free, vegan", 
    "Recommended Skin Type: Normal", "Sustainability Claims: Cruelty-free", 
    "TCIN: 18828570", "UPC: 858858000358", "Item Number (DPCI): 037-08-1164"
    ), NA_character_, NA_character_, c("Aura Cacia pure essential oils in 4 fl oz Body Oil has a lavender and cocoa butter scent. This natural skin care oil shows skin tone improvement that you can feel.", 
    "Scent: Lavender, Cocoa Butter", "Health Facts: Contains lavender, butylparaben-free, phthalate-free, formaldehyde donor-free, formaldehyde-free, nonylphenol ethoxylate free, propylparaben-free, Sulfate-free, paraben-free, dye-free, aluminum-free", 
    "Product Form: Lotion", "Suggested Age: All Ages", "Recommended Skin Type: Normal", 
    "Beauty Purpose: Skin tone improvement", "Sustainability Claims: Not tested on animals, cruelty-free", 
    "TCIN: 50030689", "UPC: 051381911720", "Item Number (DPCI): 037-05-1378"
    ), c("Deep clean pores with the Facial Cleansing Brush from Eco", 
    "Tools. This compact brush features soft bristles for moderate exfoliation, leaving you with soft, supple skin. Your serums and moisturizers can more effectively penetrate your skin once all the dead skin cells are out of the way. The compact size is ideal for packing in your weekend tote or suitcase for cleansing on the go.", 
    "Material: Nylon", "Suggested Age: All Ages", "Beauty Purpose: Basic cleansing, exfoliating", 
    "TCIN: 52537254", "UPC: 079625074864", "Item Number (DPCI): 037-08-2254"
    ), c("Deep Steep Rosemary Mint Sugar Scrub gently exfoliates dead skin cells while moisturizing, leaving smooth, radiant, polished skin. This formula is made up of a smooth blend of shea butter, cocoa butter and carefully sourced sugar to give you light, blissful fragrance with just the right amount of exfoliation and no harsh scratching. Apply desired amount of Deep Steep Rosemary Mint Sugar Scrub to wet skin from shoulders to ankles. Massage in a circular motion. Rinse.", 
    "Scent: Rosemary", "Health Facts: Contains argan oil, contains coconut oil, contains shea butter, formaldehyde donor-free, gluten-free, dye-free, ethyl alcohol-free, paraben-free, phthalate-free, vegan", 
    "Product Form: Scrub", "Suggested Age: All Ages", "Recommended Skin Type: Dry, normal", 
    "Beauty Purpose: Exfoliating", "TCIN: 53242409", "UPC: 674749101153", 
    "Item Number (DPCI): 037-08-2123"), NA_character_, c("Want to feel gorgeously soft skin every day? Transform your daily shower into an irresistible treat with the exquisitely fragranced Caress Evenly Gorgeous body wash. Indulge your skin with a rich exfoliating lather delicately scented with burnt brown sugar and karite butter that makes this body wash smell good enough to eat. Subtle notes of soft crisp apple and berry open up to a bold floral heart, while rich scents of warm tonka bean, vanilla and balsam together round out the lush lather to leave you with perfectly buffed and glowing skin. Caress Evenly Gorgeous is a revitalizing body wash that blends rich, luxurious lather with expertly crafted fine fragrance It is a body wash that gently cleanses your skin to leave it delicately fragrant, beautifully soft.", 
    "Lather up and indulge in a deeply cleansing and reviving shower experience. With fine floral fragrance and gentle exfoliates, Caress Evenly Gorgeous will leave you feeling delicately perfumed and silky-smooth, making this the perfect body wash for every day? and every night. Caress body wash and beauty bar fragrances are crafted by the world's best perfumers to transform your daily shower into an indulging experience that will make you feel special every day?Scent: Fresh", 
    "Health Facts: Aluminum-free, paraben-free, fluoride-free", "Product Form: Liquid", 
    "Suggested Age: 5 Years and Up", "Wellness Standard: Aluminum-free, paraben-free", 
    "Recommended Skin Type: Normal", "Beauty Purpose: Basic cleansing", 
    "Package Quantity: 1", "TCIN: 13446229", "UPC: 011111014909", 
    "Item Number (DPCI): 049-00-0806"), c("Maintain a sanitary and healthy atmosphere with the MEDLINE n/a READYBATH, PREMIUM,FRAG FREE, 8/PK - 24pks. These sterile swab sticks are pre-treated with povidone-iodine for preparing skin for incision and other medical issues. Comes in disposable packages of 3.", 
    "Scent: Unscented", "Health Facts: No fragrance added", "Suggested Age: Adult Use Only", 
    "Recommended Skin Type: Normal", "Beauty Purpose: Basic cleansing", 
    "Package Quantity: 1", "TCIN: 14339945", "UPC: 080196731445", 
    "Item Number (DPCI): 037-13-0198"))`

1 个答案:

答案 0 :(得分:0)

好的,在解决了上述问题之后,我认为这应该可以解决问题:

top_attribute_names<-c("Item Number \\(DPCI\\)",
                       "UPC",
                       "TCIN", 
                       "Product Form",
                       "Package_Quantity",
                       "Health_Facts")
new_attribute_names<-c("DCPI",
                       "UPC",
                       "TCIN",
                       "product_form",
                       "package_quantity",
                       "health_facts")

my_grep_function <- function(x){
    val <- grepl(top_attribute_names[i], x)
    val_out <- ifelse(sum(val) > 0, x[val], NA)
    return(val_out)
}

for (i in seq_along(top_attribute_names)) {
    tar[new_attribute_names[i]] <- rapply(tar$clean.text, my_grep_function)
}

因此,首先要在top_attribute_names中确保所有这些内容都是有效的正则表达式。这就是为什么您需要加倍转义括号的原因。

您最终想要在apply中调用的函数可以浓缩为一行,但这确实很丑陋,所以我将其分别定义。似乎并非data.frame中的每个条目都有您要搜索的每个字段

然后最关键的一点是,您要使用rapply,因为您想将此功能递归地应用于列表列表。

作为替代方案,您可以将列表列表转换为自己的data.frame(遵循this非常有用的答案):

max_len <- max(sapply(tar$clean.text, length))
corrected_text <- lapply(tar$clean.text, function(x) {c(x, rep(NA, max.len - length(x)))})
new_data_frame <- as.data.frame(do.call(rbind, corrected_text), stringsAsFactors = F)

您需要前两行,因为并非tar$clean.text中的每个列表都具有相同的长度。我认为这些将是下一步:

names(new_data_frame) <- c("Slogan", sub(":.*", "", new_data_frame[1,2:12]))
names(new_data_frame) <- gsub(" ", "_", names(new_data_frame))

for(i in 3:11){
    new_data_frame[[i]] <- sub("^.*?:\\s+", "", new_data_frame[[i]])
}

第二种策略的缺点是,因为您有这些不相等的列表,并且在所有内容的末尾都填充了NA,所以您将引入一些错误。例如,在您提供的示例中,第9个条目的DPCI推送到最后一列,对于所有其他值均为NA。好的一面是,您可以将第一部分转化为可用形式。但这也可以通过另一种方式实现:

get_first_function <- function(x){
    return(x[[1]])
}
tar$slogan <- rapply(tar$clean.text, get_first_function)

运行上面的do.call方法,然后运行tar$slogan_2 <- new_data_frame[[1]]