我正在使用readLines()
从网站中提取HTML代码。在代码的几乎每一行中都有<td>VALUE1<td>VALUE2<td>
形式的模式。我想取<td>
之间的值。我尝试了一些汇编,例如:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\\2',x='<td>VALUE1<td>VALUE2<td>')
但输出只返回一个值。知道怎么做吗?
答案 0 :(得分:1)
string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)
答案 1 :(得分:1)
您是否看过&#34; XML&#34;可以从HTML中提取表的包吗?您可能需要提供您尝试解析的整个消息的更多上下文,以便我们可以查看它是否合适。