我有一个字符向量,它是通过pdftotext
(命令行工具)进行一些PDF抓取的文件。
一切都(幸福地)很好地排成一列。但是,向量中充斥着一种空格,这种空白使我的正则表达式无效:
> test
[1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care"
[6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee"
> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
"Pewaukee")
> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
+ "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
+ "Pewaukee")
> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown"
显然,dput
中没有分配一些字符,如下面的问题:
How to properly dput internationalized text?
我无法复制/粘贴整个矢量....如何搜索并销毁这个非空白空格?
修改
显然,我甚至都不清楚,因为答案到处都是。这是一个更简单的测试用例:
> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE
屏幕上和dput
输出中打印的单词“Clinic”和“Information”之间只有一个空格,但字符串中的任何内容都不是标准空格。我的目标是消除这一点,以便我可以正确地将这个元素弄出来。
答案 0 :(得分:5)
将我的评论升级为答案:
您的字符串包含一个不间断的空格(U + 00A0),当您粘贴它时会转换为正常空格。使用perl风格的正则表达式匹配Unicode中所有奇怪的类似空格的字符:
grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE)
perl regexp语法为\p{categoryName}
,额外的反斜杠是包含反斜杠的字符串语法的一部分,“Zs”是“Separator”Unicode类别,“space”子类别。只有U + 00A0字符的简单方法是
grepl("[0-9]+[ \\xa0][A-Za-z ]+", test)
答案 1 :(得分:1)
我认为你是在追踪和领导白色空间之后。如果是这样,这个功能可能会起作用:
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
另请注意标签等,这可能很有用:
clean <- function(text) {
gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}
所以请使用clean,然后使用Trim:
Trim(clean(test))
还要注意短划线( - )和短划线( - )
答案 2 :(得分:1)
我没有看到任何关于空白的异常,但电话号码中的短划线是U+2010 (HYPHEN)
,而不是ASCII连字符(U+002D
)。
答案 3 :(得分:0)
test <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
"Pewaukee")
> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
library(stringr)
test2 <- str_trim(test, side = "both")
> grepl("[0-9]+ [A-Za-z ]+",test2)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# So there were no spaces in the vector, just the screen output in this case.