循环匹配和替换具有不同文本格式的字符串列表

时间:2018-10-24 22:08:45

标签: r

我正在研究一个问题,数据挖掘领域的许多人可能也面临着这个问题。这是该问题的有效示例:

我有两个数据集。

  • 一个数据集(pa)具有产品属性(product_attribute) 将填充工具提示中呈现的面向用户的内容 (tooltip_descriptions)。为了便于复制,您会发现 以下两个列表中的示例数据。
  • 第二个数据集(m)具有文本内容的变量 (Description)我将抓取与该产品的匹配项 pa中的属性。

这里有个要注意的地方:product_attribute术语可能以多种格式“在野外”出现,并且当它们嵌入文本斑点时,其大小写可能会发生变化。例如,字符串paraben-free可能显示为:

  • paraben-free
  • Paraben-free
  • Paraben-Free
  • Paraben Free
  • Paraben free

因此,我创建了product_attributes的几种不同版本,用于标识属性可能在语料库中出现的不同方式。

product_attributes<-c("paraben-free","kosher","aluminum-free","gluten-free","dye-free","phthalate-free","sulfate-free")
tooltip_filter_spaced_words<-gsub("-"," ",tooltip_filter_words) # EXAMPLE: paraben free
tooltip_filter_caps_words<-capitalize(tooltip_filter_words) # EXAMPLE: Paraben-free
tooltip_filter_caps_spaced_words<-capitalize(tooltip_filter_spaced_words)# EXAMPLE: Paraben free
tooltip_filter_both_caps_spaced_words<-gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", tooltip_filter_spaced_words, perl=TRUE) # EXAMPLE: Paraben Free

对我来说,下一步是我在product_attributes处插入其他文本内容作为工具提示,以解释这些词的含义。相关的工具提示说明在此列表中:

tooltip_descriptions<-c("Parabens are a type of preservative usually used in health and beauty products to increase their shelf life by preventing the growth of mold and bacteria. Research has found that parabens can penetrate the skin and disrupt hormones like estrogen. This disruption has been linked to breast cancer and reproductive issues.", 
    "Foods that are labeled as kosher meet the requirements of Jewish law falling into three categories: meat, dairy, and pareve. Only meat from animals with split hooves and that chew their cud is permissible. Additionally, meat and dairy may not be prepared or eaten together. Pareve are all other food which are netiher meat nor dairy and can be eaten with either. ", 
    "Aluminum is used in health and skin care products such as antacids and antiperspirents often contain aluminum. A small amount may enter the body through skin contact through use of antipersprirents. Low levels of exposure is not harmful. However, scientific studies find mixed results on the risks of high levels of exposure. Some studies find that high levels of exposure may be associated with Alzheimers and Kidney Disease while others find no such health risk.", 
    "Gluten is a group of proteins found in wheat, rye, and barley. Glutens acts a glue that helps foods maintain their shape. About 1 in 100 people worldwide have celiacs disease, an immune condition that hinders the ability to digest gluten. Others may have a gluten sensitivity which results in symptoms similar to celiacs disease but it does not affect the immune system. ", 
    "Dyes are used to color food and medications. Some dyes have been linked to hyperactivity in children. Some research has also linked various dyes to cancer. The United Kingdon has placed a voluntary ban on dyes because of these potential risks. There is currently no such ban in the US. ", 
    "Phthalates are a group of compounds used to make plastic durable and flexible. People come into contact with phthalates by consuming food from plastic containers. Infants and young children may also come into contact with phthalates through plastic toys and hand-to-mouth behaviors. Once consumed, they pass quickly through urine. The effects of phthalates for humans are still unknown. Some test with lab animals show some effects on the reproductive system.", 
    "Sulfates are used in most shampoos, soaps, and household cleaning products to attract dirt and oil and remove them. They also produce a thick lather. However, they can also strip skin and hair of natural oils. In large amounts, such as found in industrial cleaners, they can be irritating to the skin. However, there are no known severe negative consequences of sulfates such as cancer or tumors. "
    )

不出所料,工具提示中包含的文本通常包含product_attribute的名称,因为它正在解释该术语!

最后,我为工具提示创建了一个简单的类似HTML的“包装器”,将由Wordpress插件读取。可以认为这与在野外在product_attributes周围添加HTML包装程序,以便在tooltip_descriptions参数后面填充content=一样。

insert_tooltip<-c("[simple_tooltip content='")

然后,我只需遍历一个循环,该循环的时间与过滤词的数量一样长,并使用stringi查找与字符串的不同格式匹配的内容。

for (i in seq_along(tooltip_filter_words)) {
  # Both capitalized spaced
  print(tooltip_filter_both_caps_spaced_words[i])
  tooltip_shortcode<-paste(tooltip_positive_formatting, tooltip_descriptions[i],"']",tooltip_filter_both_caps_spaced_words[i],"[/simple_tooltip]")
  m$Description<-stringi::stri_replace_all_regex(m$Description,tooltip_filter_both_caps_spaced_words[i],tooltip_shortcode, vectorize_all = F) # Replace phrases with HTML
  # Capitalized hyphenated
  tooltip_shortcode<-paste(tooltip_positive_formatting, tooltip_descriptions[i],"']",tooltip_filter_caps_words[i],"[/simple_tooltip]")
  m$Description<-stringi::stri_replace_all_regex(m$Description,tooltip_filter_caps_words[i],tooltip_shortcode, vectorize_all = F) # Replace phrases with HTML
  # Lowercase hyphenated
  tooltip_shortcode<-paste(tooltip_positive_formatting, tooltip_descriptions[i],"']",tooltip_filter_words[i],"[/simple_tooltip]")
  m$Description<-stringi::stri_replace_all_regex(m$Description,tooltip_filter_words[i],tooltip_shortcode, vectorize_all = F) # Replace phrases with HTML
  # Capitalized spaced
  tooltip_shortcode<-paste(tooltip_positive_formatting, tooltip_descriptions[i],"']",tooltip_filter_caps_spaced_words[i],"[/simple_tooltip]")
  m$Description<-stringi::stri_replace_all_regex(m$Description,tooltip_filter_caps_spaced_words[i],tooltip_shortcode, vectorize_all = F) # Replace phrases with HTML
  # Lowercase spaced
  tooltip_shortcode<-paste(tooltip_positive_formatting, tooltip_descriptions[i],"']",tooltip_filter_spaced_words[i],"[/simple_tooltip]")
  m$Description<-stringi::stri_replace_all_regex(m$Description,tooltip_filter_spaced_words[i],tooltip_shortcode, vectorize_all = F) # Replace phrases with HTML
}

这是问题所在:当循环按时间顺序执行Descriptions的先前修改时,有时会遇到文本匹配,这些匹配是已插入文本的工具提示内容的部分。例如,tooltip_descriptionDescription中都同时包含单词“ kosher”,因此多次循环会造成混乱。

请注意,我无法更改说明的格式,因为我的内容将被上传回面向用户的网站。而且,我没有替代的技术解决方案在网站上获取工具提示。但是,我敢肯定,在R中,有更聪明的方法可以做到这一点。

下面是Descriptions的一些示例文本:

c("Sharing Size\nResealable Zipper\\!\nMars Real Chocolate\nKosher\nMars, Incorporated\n", 
"14 g whole grains per serving\nSee nutrition facts for sodium content\nGluten free\nWith other natural flavors\nPer serving: 130 calories, 0.\n5 g sat fat, 390 mg sodium, 1 g sugars\nA delicious snack with a light and crispy crunch. Made with rice and grains, we add a little heat and pressure and pop\\! A satisfying, crunchy snack you can feel good about.\n2016 The Quaker Oats Company\n", 
"Made with real fruit\nEqual to 20\\% fruit100\\% vitamin CFat free\nGluten free\nNaturally \\& artificially flavored\nAssorted Disney Princess pieces inside: Ariel, Belle, Anna, Snow White, Jasmine, Cinderella, Elsa, Rapunzel, Aurora.\n2016 Kellogg NA Co.\n", 
"Milk chocolate with smooth caramel filling\nBelgium 1926\nIntroducing Godiva Masterpieces - Godivaâ\u0080\u0099s most exquisite chocolates now available in delightful individually wrapped mini chocolates, perfect to enjoy anytime. Each masterpiece is crafted in the shape of a signature chocolate and filled with smooth and creamy fillings that melt in your mouth.\nThe Lion of Belgium is inspired by Godiva's most cherished chocolate, a majestic chocolate â\u0080\u009cshieldâ\u0080\u009d proudly embossed with the Belgian coat of arms. A creamy milk chocolate filled with a sublime caramel filling.\nHalal\nKosher\nTo ensure product quality, please keep this package stored in a cool place at or below 65°F (18°C). Â\nGodiva\n", 
"High potency B vitamins help convert food into energy\\*Advanced beauty formula\nHigh potency biotin 5000 mcg\nCollagen types I \\& IIISilica\nHyaluronic acid100\\%+ DV 19 vitamins and minerals\nGluten free\nNo sugar, salt, yeast, wheat, soy, synthetic flavors or preservatives\nTake 2 softgels daily. If pregnant, nursing, taking any medications or have kidney dysfunction, consult a healthcare professional before use. Â\n2016 Nature's Way Brands, LLC\n", 
"Milk chocolate, peanuts, caramel, nougat\nFun size\nPerfect for snacking\n80 calories per bar\nPacked with peanuts. Snickers really satisfies. Kosher.\nMars, Incorporated\n", 
"Helps support your immune system\\*\nBlast of vitamin C plus 9 vitamins, minerals \\& herbs\nNaturally and artificially flavored\nTake up to 3x per day\n1,000 mg of vitamin C. With antioxidants (vitamins C \\& E). 35 mg of herbal blend including echinacea \\& ginger. Gluten free.\nAdults and children 14 years and older take 3 gummies. Chew thoroughly before swallowing. Repeat as necessary up to 3 times per day, no more than 9 gummies per day. Children 12 to 13 years of age take 3 gummies. Chew thoroughly before swallowing. Repeat as necessary up to 2 times per day, no more than 6 gummies per day. Not for younger children due to the risk of choking. Store in a cool, dry place. Â\n2016 RB\n", 
"America's pretzel bakery since 1909\nFilled with creamy goodness\nYou'll love our Cheddar Cheese Pretzel Sandwiches. Imagine real, tangy cheddar cheese sandwiched between two bite-sized pretzel snaps. They're a delicious snack on-the-go for kids and grown-ups. One taste and we think you'll agree. Nobody makes bakery pretzels like Snyder's of Hanover. Kosher dairy.\nMade in USA\nSnyder's-Lance, Inc.\n", 
"Made with fresh ricotta \\& cream\nIn Parma, cheese-making has been raised to an art form. Here, fresh ricotta, provolone and aged Parmesan and Romano cheeses add a wonderful richness to this red ripened tomato sauce. Gluten free.\nSimmer and serve\nRefrigerate after opening\nBest if used within 5 days Â\nKraft Heinz Sauces \\& Frozen\n", 
"Fair Trade\nOwned by cocoa farmers\nMade for chocolate lovers\nCocoa solids 38\\% minimum in chocolate. Milk solids 16\\% minimum in chocolate. Fair Trade certified: sugar, cocoa, vanilla; traded in compliance with Fair Trade standards; total 79\\% of the product's ingredients. GMO free. Kosher.\nProduct of Germany\nStore in a cool, dry place.\n"
)

谢谢您的帮助!

0 个答案:

没有答案