Question

我正在尝试将大量文本数据重新编码为文本或数值。

我的数据集包括咖啡店的名称。我想将这些咖啡店重新编码为＆＃34; corporation＆＃34;或者＆＃34;小型企业＆＃34;。问题是这些咖啡店的拼写方式有所不同（例如，星巴克与星巴克，星巴克咖啡）。我想创建一个代码来扫描数据集中的单词＆＃34; star＆＃34;并将其重新编码为＆＃34; corporation＆＃34;。

示例数据：

customers <- data.table(customer_id = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5), 
                        store = c("starbcks", "peets", "coffee bean", "drnk", "starbucks", "coffee ben", "coffee bean", "coffee bean", "drnk", "starbucks coffee"))

我想重新编码＆＃34;商店＆＃34;列进入＆＃34;键入＆＃34;，然后我会将其计算并重新编码为数值。

customers <- data.table(customer_id = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5), 
                        store = c("starbcks coffee", "portfolios", "coffee bean", "sharkhead", "starbucks", "coffee ben", "cuppa cuppa", "coffee bean", "drnk", "starbucks coffee"),
                        type = c("corporation", "small business", "corporation", "small business", "corporation", "corporation", "small business", "corporation", "corporation", "corporation"),
                        rc_type = c(1, 2, 1, 2, 1, 1, 2, 1, 1, 1))

我已经查看了stringr包并尝试了标准的重新编码方式，但无济于事。任何帮助都很感激。谢谢！

如何重新编码包含特定文本的文本

0 个答案: