从凌乱的大量数据中提取值

时间:2018-05-23 08:22:03

标签: r regex string web-scraping stringr

我想从中提取信息的数据混乱。现在,我还没有找到一种方便的方法来提取信息,我希望你能提供帮助。我的数据如下:

"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
       n\r\nDates\r\nSeptember 25th 2016 To September 26th 
         2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited 
         States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited 
         States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"

现在,我想要摆脱的是:

Channels                - 
Dates                   September 25th 2016 To September 26th 2016
Platform                Idea
Country                 United States
Restricted Countries    United States
Initial Price           $0.0692

我需要为大量观察执行此任务,然后将每个变量存储为所有观察的向量。因此,我不需要存储变量的名称(即“平台”),而只需存储结果(“想法”)。但要做到这一点,我需要将“平台”变量名称作为“标识符”,我认为,因为文本中变量的位置随着观察结果的变化而变化(变量的数量也只是略有变化)。

现在,我认为 stringr 包是一个很好的方法,但我没有找到一个方便的方法来做到这一点。

2 个答案:

答案 0 :(得分:3)

以下正则表达式提取您想要的值。这些值存储在结果矩阵的第2-7列中。代码使用输入向量(每个条目在矩阵中形成一个新行)

library(stringr)

input <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nDates\r\nSeptember 25th 2016 To September 26th 2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"

str_match(input, paste0("[[:space:]]*Channels[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Dates[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Platform[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Country[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Restricted Countries[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Initial Price[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*"))

编辑:对不起,我忽略了文本中变量的位置可以在不同输入之间变化。在这种情况下,您无法使用此方法一次性轻松提取所有变量。但是,您仍然可以使用上面的正则表达式中的相应行逐个提取它们。如果某个变量不存在(如示例中的“频道”),则不会出现问题 - 它将显示为NA)。

答案 1 :(得分:2)

Base R解决方案:

yourstring1 <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
n\r\nDates\r\nSeptember 25th 2016 To September 26th 
2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited 
States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited 
States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"

# make a placeholder (useful when manipulating strings for easier regex)
yourstring2 <- gsub("\r|\t|\nn|\n", "@", yourstring1, perl = T) # please note the double nn - this is so because a newline character is added when copying from here to R
# split on placeholder if it appears twice or more
yourstring2 <- unlist(strsplit(yourstring2, split = "@{2,}"))
# little cleaning needed
yourstring2 <- gsub(" @", " ", yourstring2)
yourstring2[1:2] <- c(yourstring2[2], "-") # this hard-coded solution works for the particular example, if you have many strings with arbitrarily missing values, you may want to make a little condition for that
# prepare your columns by indexing the character vector
variables <- yourstring2[seq(from = 1, to = length(yourstring2), by = 2)]
values <- yourstring2[seq(from = 2, to = length(yourstring2), by = 2)]
# bind them to dataframe
df <- data.frame(variables, values)

结果df:

df
             variables                                     values
1             Channels                                          -
2                Dates September 25th 2016 To September 26th 2016
3             Platform                                       Idea
4              Country                              United States
5 Restricted Countries                              United States
6        Initial Price                                    $0.0692

编辑:只是现在我正确地读到了,而不是数据帧,期望的结果可能是位置的向量...这是一个双线解决方案

yourstring2 <- gsub("\r|\t|\nn|\n", "", yourstring1, perl = T) #clean the original string (see above yourstring1)
yourvector <- unlist(strsplit(yourstring2, split = "Channels|Dates|Platform|Country|Restricted Countries|Initial Price", perl = T))[-1]  # extract

结果载体:

   > yourvector
[1] ""                                          
[2] "September 25th 2016 To September 26th 2016"
[3] "Idea"                                      
[4] "United States"                             
[5] "United States"                             
[6] "$0.0692"