将列表中的字符串与R

时间:2018-08-04 21:34:38

标签: r

我有两个字符串列表,并且想用文本搜索一列,以将一个字符串中的项目替换为第二个字符串中的项目。第二个字符串与第一个字符串相同,但是包含用于HTML格式的标签。

我编写了一个小函数,尝试对第一个列表中的每个项目进行grep替换,但不能正常工作。我也尝试了str_replace无济于事。

top_attribute_names<- c("Item Number \\(DPCI\\)", "UPC", "TCIN", "Product Form", "Health Facts", 
"Beauty Purpose", "Package Quantity", "Features", "Suggested Age", 
"Scent")

top_attributes_html<-ifelse(nchar(top_attribute_names)<30,paste("<b>",top_attribute_names,"</b>",sep=""),top_attribute_names) # List adding bold HTML tags for all strings with under 30 char

clean_free_description<-
c("Give your feathered friends a cozy new home with the Ceramic and Wood Birdhouse from Threshold. This simple birdhouse features a natural color scheme that helps it blend in with the tree you hang it from. The ceramic top is easy to remove when you want to clean out the birdhouse, while the small round hole lets birds in and keeps predators out. Sprinkle some seeds inside and watch your bird buddies become more permanent residents of your backyard.\nMaterial: Ceramic, Wood\nDimensions (Overall): 7.7 inches (H) x 8.5 inches (W) x 8.5 inches (L)\nWeight: 2.42 pounds\nAssembly Details: No assembly requiredpets subtype: Bird houses\nProtective Qualities: Weather-resistant\nMount Type: Hanging\nTCIN: 52754553\nUPC: 490840935721\nItem Number (DPCI): 084-09-3572\nOrigin: Imported\n", 
"House your parakeets in style with this Victorian-style bird cage. Featuring multiple colors and faux brickwork, the cage serves as a charming addition to your dcor. It's also equipped with two perches and feeding dishes, making it instantly functional.\nMaterial: Steel, Plastic\nDimensions (Overall): 21.5 inches (H) x 16.0 inches (W) x 16.0 inches (L)\nWeight: 15.0 pounds\nMaterial: Metal (Frame)\nIntended Pet Type: Bird\nIncludes: Feeding Dish, perch\nAssembly Details: Assembly required, no tools needed\nPets subtype: Bird cages\nBreed size: Small (0-25 pounds)\nSustainability Claims: Recyclable\nWarranty: 90 day limited warranty. To obtain a copy of the manufacturer's warranty for this item, please call Target Guest Services at 1-800-591-3869.\nWarranty Information:To obtain a copy of the manufacturer's warranty for this item, please call Target Guest Services at 1-800-591-3869.\nVictorian-style parakeet cage with 2 perches\nFeatures a molded base, a single front door and faux plastic brickwork\nMade of wire and plastic; 5/8\" spacing\nWash with soap and water18\nLx25.5\nHx18\nW\"TCIN: 10159211\nUPC: 048081002940\nItem Number (DPCI): 083-01-0167\n", 
"The Cockatiel Scalloped Top Bird Cage Kit is an ideal starter kit for cockatiels and other medium sized birds. Designer white scalloped style cage features large front door, easy to clean pull out tray, food and water dishes, wooden perches and swing. To help welcome and pamper your new bird, this starter kit also includes perch covers, kabob bird toy, cuttlebone, flavored mineral treat and a cement perch. Easy to assemble.\nMaterial: Metal\nDimensions (Overall): 27.25 inches (H) x 14.0 inches (W) x 18.25 inches (L)\nWeight: 11.0 pounds\nMaterial: Metal (Frame)\nIntended Pet Type: Bird\nPets subtype: Bird cages\nBreed size: All sizes\nTCIN: 16707833\nUPC: 030172016240\nItem Number (DPCI): 083-01-0248\n")

for(i in top_attribute_names){
  clean_free_description[grepl(i, clean_free_description)] <- top_attributes_html[i]
}

从理论上讲,我认为我也可以使用str_replace来做到这一点:

clean_free_description<-str_replace(clean_free_description,top_attribute_names,top_attributes_html)

但是,这会产生错误:

  

在stri_replace_first_regex中(字符串,模式,fix_replacement(替换)、:     较长的物体长度不是较短的物体长度的倍数

当然,我敢肯定,有一个更好的解决方案,可以添加HTML标记,从而通过匹配正则表达式中的字符串并添加文本包装器来消除步骤。不幸的是,我在Regex上还差强人意,还没弄清楚。

2 个答案:

答案 0 :(得分:1)

您可以尝试使用stringi::stri_replace_all,如下所示。由于长度的原因,我没有在此处绘制完整的输出,但是提供了一个简短的示例来演示基本功能,希望这就是您想要的。

更新:我为stringi和stringr解决方案添加了一个基准,这清楚了为什么我没有坚持您的原始代码,而是在这里介绍了stringi。

stringi::stri_replace_all_regex(c("a", "b", "c"),c("b", "c"),c("x", "y"), vectorize_all = F)
#[1] "a" "x" "y"

stringi::stri_replace_all_regex(clean_free_description,top_attribute_names,top_attributes_html, vectorize_all = F)

library(stringr)
library(stringi)

f_stringr = function() {
   names(top_attributes_html) <- top_attribute_names
   str_replace_all(clean_free_description, top_attributes_html)
}

f_stringi = function() {
  stri_replace_all_regex(clean_free_description,top_attribute_names,top_attributes_html, vectorize_all = F)
}

all.equal(f_stringr(), f_stringi())
# TRUE

microbenchmark::microbenchmark(
   f_stringr(), 
   f_stringi()
)
# Unit: microseconds
#        expr     min      lq      mean   median       uq      max neval
# f_stringr() 937.129 956.274 1041.7329 1053.579 1076.276 1296.743   100
# f_stringi() 122.767 128.491  136.6937  137.372  142.899  245.138   100

答案 1 :(得分:1)

我认为这应该可以满足您的需求:

library(stringr)
names(top_attributes_html) <- top_attribute_names
str_replace_all(clean_free_description, top_attributes_html)