Question

我正在使用readLines()从网站中提取HTML代码。在代码的几乎每一行中都有<td>VALUE1<td>VALUE2<td>形式的模式。我想取<td>之间的值。我尝试了一些汇编，例如：

output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\\2',x='<td>VALUE1<td>VALUE2<td>')

但输出只返回一个值。知道怎么做吗？

Answer 1

string <- "<td>VALUE1<td>VALUE2<td>"   

regmatches(string , gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T) )

# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T)
# this should be the result

# [1]  5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the 
#second match starts at index 15

#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the 

#second match with length of 6

# then get the result of this match and pass it to regmatches function to 
# substring your string at these indices
regmatches(string , indices)

Answer 2

您是否看过＆＃34; XML＆＃34;可以从HTML中提取表的包吗？您可能需要提供您尝试解析的整个消息的更多上下文，以便我们可以查看它是否合适。

R：查找patern并获取其间的值

2 个答案: