Question

我陷入了一场噩梦，我一直试图在论坛中找不到成功。所以我试着直接询问。

我有一个包含随机城市的不规则字符串的向量，我想从包含城市名称的键值向量中提取/标记每个不规则字符串。例如，

Vector <- c("...the life in Paris is ...","In Roma, there is...","...nice weekend in New York with...")
Cities <- c("London","Paris","Madrid","Roma","New York")

对于Vector中的每个字符串，应该有来自Cities的相应值。

我在考虑在开始时使用循环，但是数据大小使R搜索太长，我更想考虑使用grep进行基本计算，但我总是会遇到错误。

你知道这是否是正确的方法？

Answer 1

您可以使用sapply和grepl：

check_vec <- sapply(Cities, grepl, Vector)
row.names(check_vec) <- Vector

check_vec
#                                    London Paris Madrid  Roma New York
#...the life in Paris is ...          FALSE  TRUE  FALSE FALSE    FALSE
#In Roma, there is...                 FALSE FALSE  FALSE  TRUE    FALSE
#...nice weekend in New York with...  FALSE FALSE  FALSE FALSE     TRUE

如果您需要每个向量的关键字：

apply(check_vec, 1, function (x) colnames(check_vec)[which(x)])
#        ...the life in Paris is ...                In Roma, there is... ...nice weekend in New York with... 
#                            "Paris"                              "Roma"                          "New York"

修改

为了更安全的方式，正如@nicola明智地建议的那样，您可以使用vapply代替sapply：

check_vec <- vapply(Cities, grepl, x=Vector, logical(length(Vector)))

Answer 2

以下是使用文本分析软件包 quanteda 的方法。它允许您为城市名称设置一组模式匹配，例如，如果您有不同的城市拼写（例如＆＃34;罗马＆＃34;和＃34;罗马＆＃34;）但是想要把它们算作一个单一的城市。在匹配项下面使用简化的＆＃34; glob＆＃34;格式，但您也可以使用正则表达式匹配。

require(quanteda)

# only required if you have compound word city names
compoundCities <- dictionary(list(NY = "New York"))
VectorPhrased <- phrasetotoken(Vector, compoundCities)

# uses the "glob" format for Pattern Matching
citiesDict <- dictionary(list(London = c("London", "Londres"), Paris = "Paris", 
                              Rome = "Rom?", NewYork = "New_York"))

dfm(VectorPhrased, dictionary = citiesDict, verbose = FALSE)
# Document-feature matrix of: 3 documents, 4 features.
# 3 x 4 sparse Matrix of class "dfmSparse"
#        features
# docs    London Paris Rome NewYork
#   text1      0     1    0       0
#   text2      0     0    1       0
#   text3      0     0    0       1

R匹配键值向量与不规则字符串向量

2 个答案: