我正在尝试抓取网页的内容,以便检查是否 存在Stata数据集。
我整理了几行代码,但是它们不起作用:
tempfile page
copy "https://www.stata-press.com/data/r15/u.html" "`page'"
tempname fh
file open `fh' using "`page'", read
file read `fh' line
while r(eof)==0 {
if "`line'"=="regsmpl.dta" dis "Dataset exists"
else dis "Dataset doesn't exit"
file read `fh' line
}
file close `fh'
任何想法都会受到高度赞赏。
答案 0 :(得分:2)
您可以先使用fileread()
函数将整个页面放入标量变量:
local dataset regsmpl
scalar page = fileread("https://www.stata-press.com/data/r15/u.html")
成功创建标量后,可以使用两种方法进行处理。
解决方案1:检查页面中是否提到了数据集
if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
else display "No trace of dataset in page"
解决方案2:检查是否存在指向数据集的实际链接
local link = ustrregexm(page, `"<a [^>]*\bhref\s*=\s*"([^"]*`dataset'.dta[^"]*)"')
local url = trim(ustrregexs(1))
if "`url'" != "" display "The link is: `url'"
else display "There is no such link"
您的方法也可以同时使用strmatch()
和正则表达式:
tempname fh
file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
file read `fh' line
local tag = 0
while r(eof) == 0 {
if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
file read `fh' line
}
if `tag' == 1 display "Dataset exists"
else display "Dataset doesn't exit"
tempname fh
file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
file read `fh' line
local tag = 0
while r(eof) == 0 {
local link = ustrregexm(`"`line'"', `"<a [^>]*\bhref\s*=\s*"([^"]*`dataset'.dta[^"]*)"')
if `link' == 1 {
local url = trim(ustrregexs(1))
local tag = 1
}
file read `fh' line
}
if `tag' == 1 display "The link is: `url'"
else display "There is no such link"