匹配单词模式并在R中用NA替换单词

时间:2018-01-16 11:24:02

标签: r

我在R中有2个数据帧 - 一个名字列表,另一个是单词词典。如果名称的任何部分是单词字典的一部分,则由NA替换,否则返回名称

名称 - 数据框

public class SampleTimerRouter extends RouteBuilder {

    @Override
    public void configure() throws Exception {

        from("timer:simpleTimer1?period=2000").process((exchange) -> {
            List<TestPOJO> names = new ArrayList<>();
            names.add(new TestPOJO("f1"));
            names.add(new TestPOJO("f2"));
            exchange.getOut().setBody(names);
        }).split(body(TestPOJO.class)).to("stream:out");


        from("timer:simpleTimer2?period=2000").process((exchange) -> {
            List<String> names = new ArrayList<>();
            names.add("s1");
            names.add("s2");
            exchange.getOut().setBody(names);
        }).split(body(String.class)).to("stream:out");

    }
}

class TestPOJO {
    private String fName;

    public TestPOJO(String f) {
        fName = f;
    }

    public String getfName() {
        return fName;
    }

    public void setfName(String fName) {
        this.fName = fName;
    }

    @Override
    public String toString() {
        return fName;
    }
}

word dictionary - Dataframe

Name
Louis
Messi
duplessis
Jegan
Praveen

预期产出

Dictionary
vee
sis

1 个答案:

答案 0 :(得分:2)

library(data.table) # needed library

# create data
dt <- data.table("Name"=c("Louis",
                          "Messi",
                          "duplessis",
                          "Jegan",
                          "Praveen"))
dict <- c("vee","sis")

# make a combined vector of the words in the dictionary
dict_2 <- paste0(dict,collapse = "|") 
# desired output
dt[,processed_Name:=ifelse(Name%like%dict_2,NA,Name)]

<强>输出

        Name processed_Name
1:     Louis          Louis
2:     Messi          Messi
3: duplessis             NA
4:     Jegan          Jegan
5:   Praveen             NA

根据OP的评论更新

  # changed the input a bit, so that it contains the numbers 
# that i am going to generate for the dictionary.
dt <- data.table("Name"=c("Loui1s",
                          "Messi",
                          "duple2ssis",
                          "Jegan",
                          "Praveen"))

dict_all <- as.character(c(1:5000)) # i generate numbers so that they all are different
dict_split <- split(dict_all, ceiling(seq_along(dict_all)/1000))
dict_split_2 <- lapply(dict_split, function(x){paste0(x, collapse = "|")})
dt[,processed_Name_2:=ifelse(Name%like%dict_split_2[[1]] | Name%like%dict_split_2[[2]] |
                               Name%like%dict_split_2[[3]] | Name%like%dict_split_2[[4]] |
                               Name%like%dict_split_2[[5]],NA,Name)]