我正在尝试将大量文本数据重新编码为文本或数值。
我的数据集包括咖啡店的名称。我想将这些咖啡店重新编码为" corporation"或者"小型企业"。问题是这些咖啡店的拼写方式有所不同(例如,星巴克与星巴克,星巴克咖啡)。我想创建一个代码来扫描数据集中的单词" star"并将其重新编码为" corporation"。
示例数据:
customers <- data.table(customer_id = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
store = c("starbcks", "peets", "coffee bean", "drnk", "starbucks", "coffee ben", "coffee bean", "coffee bean", "drnk", "starbucks coffee"))
我想重新编码&#34;商店&#34;列进入&#34;键入&#34;,然后我会将其计算并重新编码为数值。
customers <- data.table(customer_id = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
store = c("starbcks coffee", "portfolios", "coffee bean", "sharkhead", "starbucks", "coffee ben", "cuppa cuppa", "coffee bean", "drnk", "starbucks coffee"),
type = c("corporation", "small business", "corporation", "small business", "corporation", "corporation", "small business", "corporation", "corporation", "corporation"),
rc_type = c(1, 2, 1, 2, 1, 1, 2, 1, 1, 1))
我已经查看了stringr包并尝试了标准的重新编码方式,但无济于事。任何帮助都很感激。谢谢!