使用自定义字典模糊匹配和替换数据框中的字符串

时间:2019-01-26 15:58:15

标签: r fuzzy-search fuzzy-comparison

我有与此相似的数据框(语法差异很小的字符串)

<div class="container">
    <ul>
        
        <li class="row product">
            <div class="img-container">
                <a href="/gallery/product?id=21">
                    <img src="https://i.imgur.com/CyYN9a7.jpg" alt="Vials">
                </a>
            </div>
            <div class="name-price-container">
                <span>
                    <a href="/gallery/product?id=21">Vials Loooooooooong Text</a>
                </span>
                <span>$30.00</span>
            </div>
            <div class="btn-container">
                <form method="POST" action="/gallery/remove_cart">
                    <input type="hidden" name="csrfmiddlewaretoken" value="...">
                    <input type="hidden" name="id" value="21">
                    <input type="submit" class="btn btn-light" value="Remove">
                </form>
            </div>
        </li>
        
        <li class="row product">
            <div class="img-container">
                <a href="/gallery/product?id=22">
                    <img src="https://i.imgur.com/PoCaEjw.jpg" alt="Driftbird">
                </a>
            </div>
            <div class="name-price-container">
                <span>
                    <a href="/gallery/product?id=22">Driftbird Loooooooooong Text</a>
                </span>
                <span>$25.00</span>
            </div>
            <div class="btn-container">
                <form method="POST" action="/gallery/remove_cart">
                    <input type="hidden" name="csrfmiddlewaretoken" value="...">
                    <input type="hidden" name="id" value="22">
                    <input type="submit" class="btn btn-light" value="Remove">
                </form>
            </div>
        </li>
        
        <li class="row product">
            <div class="img-container">
                <a href="/gallery/product?id=19">
                    <img src="https://i.imgur.com/KxAyAyE.jpg" alt="Dragon">
                </a>
            </div>
            <div class="name-price-container">
                <span>
                    <a href="/gallery/product?id=19">Dragon Loooooooooong Text</a>
                </span>
                <span>$300.00</span>
            </div>
            <div class="btn-container">
                <form method="POST" action="/gallery/remove_cart">
                    <input type="hidden" name="csrfmiddlewaretoken" value="...">
                    <input type="hidden" name="id" value="19">
                    <input type="submit" class="btn btn-light" value="Remove">
                </form>
            </div>
        </li>
        
    </ul>
</div>

这是我的自定义词典

 place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis  ")
 place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
 place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")

 places2clean <- data.frame(place1, place2, place3)

我想根据自定义词典匹配并替换所有字符串。

预期结果:

  dictionnary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")

  dictionnary <- data.frame(dictionnary)

如何使用stringdistance匹配和替换所有数据框?

2 个答案:

答案 0 :(得分:3)

此处将使用基本R函数adiststringdist::amatch函数。没有理由将您的字典变成data.frame,所以我不在这里。

如果您想尝试,可以对stringdist包使用不同的方法,尽管默认设置在这里可以正常工作。请注意,对于这两个函数,都选择了最佳匹配,但是如果没有紧密匹配(由maxDist参数定义),则将返回NA。

library(stringdist)
# Using stringdist package
clean_places <- function(places, dictionary, maxDist = 5) {
  dictionary[amatch(places, dictionary, maxDist = maxDist)]
}

# Using base R
clean_places2 <- function(places, dictionary, maxDist = 5) {
  sm <- adist(places, dictionary)
  sm[sm > maxDist] <- NA
  dictionary[apply(sm, 1, which.min)]
}

dictionary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")
place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis  ")
place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")

clean_places(place1, dictionary)
# [1] "Pondichéry" "Pondichéry" "Pondichéry" "Port-Louis" "Port-Louis"
clean_places(place2, dictionary)
# [1] "Lorient"    "Pondichéry" "Lorient"    "Port-Louis" "Port-Louis"
clean_places(place3, dictionary)
# [1] "Lorient"    "Pondichéry" "Brest"      "Port-Louis" "Nantes"    

clean_places2(place1, dictionary)
# [1] "Pondichéry" "Pondichéry" "Pondichéry" "Port-Louis" "Port-Louis"
clean_places2(place2, dictionary)
# [1] "Lorient"    "Pondichéry" "Lorient"    "Port-Louis" "Port-Louis"
clean_places2(place3, dictionary)
# [1] "Lorient"    "Pondichéry" "Brest"      "Port-Louis" "Nantes"    

答案 1 :(得分:1)

下面的方法首先计算每列与字典之间的距离矩阵,然后获得距离较小的字符串。

library(stringdist)

places2clean[] <- lapply(places2clean, trimws)

d <- lapply(places2clean, function(x) {
  sapply(dictionnary$dictionnary, function(y) stringdist(x, y))
})
res <- sapply(d, function(x){
  inx <- apply(x, 1, which.min)
  dictionnary$dictionnary[inx]
})

as.data.frame(res)
#      place1     place2     place3
#1 Pondichéry    Lorient    Lorient
#2 Pondichéry Pondichéry Pondichéry
#3 Pondichéry    Lorient      Brest
#4 Port-Louis Port-Louis Port-Louis
#5 Port-Louis Port-Louis     Nantes