我无法通过javascript regex replace清除一些html。 任务是从本地来源获取我的XBMC的电视列表。 该网址为http://tv.dir.bg/tv_search.php?step=1&all=1( in Bulgarian )。 我正在尝试使用刮刀来获取数据 - http://code.google.com/p/epgss/(归功于Ivan Markov - http://code.google.com/u/113542276020703315321/) 不幸的是,自上面的工具上次更新以来,电视列表页面已经发生了变化,所以我试图让它工作。 问题是,当我尝试从HTML解析XML时,它就会中断。 我现在正试图通过正则表达式替换head和script标签来清理html。不幸的是它不起作用。 这是我的替换者:
function regexReplace(pattern, value, replacer)
{
var regEx = new RegExp(pattern, "g");
var result = value.replaceAll(regEx, replacer);
if(result == null)
return null;
return result;
}
这是我的电话:
var htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251");
log("Content grabbed (schedule for next 7 days)");
log(url);
var htmlString = regexReplace("<head>([\\s\\S]*?)<\/head>|<script([\\s\\S]*?)<\/script>", htmlStringCluttered, "");
getHTML函数来自原始源代码,我对设置User-Agent进行了少量修改。这是它的基础:
public static java.io.Reader open(URL url, String charset) throws UnsupportedEncodingException, IOException
{
URLConnection con = url.openConnection();
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.38 Safari/532.0");
con.setAllowUserInteraction(false);
con.setReadTimeout(60*1000/*ms*/);
con.connect();
if(charset == null && con instanceof HttpURLConnection) {
HttpURLConnection httpCon = (HttpURLConnection)con;
charset = httpCon.getContentEncoding();
}
if(charset == null)
charset = "UTF-8";
return new InputStreamReader(con.getInputStream(), charset);
}
regexReplace的结果与原始结果完全相同。 并且由于无法解析XML,因此脚本无法读取元素。 有什么想法吗?
答案 0 :(得分:1)
<强>更新强>
要将其转换为XMLDocument,您可以执行以下操作:
var parseXml,
xml,
htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
htmlString = '';
if (typeof window.DOMParser != "undefined") {
parseXml = function (xmlStr) {
return (new window.DOMParser()).parseFromString(xmlStr, "text/xml");
};
} else if (typeof window.ActiveXObject != "undefined" && new window.ActiveXObject("Microsoft.XMLDOM")) {
parseXml = function (xmlStr) {
var xmlDoc = new window.ActiveXObject("Microsoft.XMLDOM");
xmlDoc.async = "false";
xmlDoc.loadXML(xmlStr);
return xmlDoc;
};
} else {
throw new Error("No XML parser found");
}
console.log("Content grabbed (schedule for next 7 days)");
console.log(url);
//eliminate the '<head>' section
htmlString = htmlStringCluttered.replace(/(<head[\s\S]*<\/head>)/ig, '')
//eliminate any remaining '<script>' elements
htmlString = htmlString.replace(/(<script[\s\S]+?<\/script>)/ig, '');
//self-close '<img>' elements
htmlString = htmlString.replace(/<img([^>]*)>/g, '<img$1 />');
//self-close '<br>' elements
htmlString = htmlString.replace(/<br([^>]*)>/g, '<br$1 />');
//self-close '<input>' elements
htmlString = htmlString.replace(/<input([^>]*)>/g, '<input$1 />');
//replace ' ' entities with an actual non-breaking space
htmlString = htmlString.replace(/ /g, String.fromCharCode(160));
//convert to XMLDocument
xml = parseXml(htmlString);
//log new XMLDocument as output
console.log(xml);
//log htmlString as output
console.log(htmlString);
parseXml
功能见于:XML parsing of a variable string in JavaScript
您只需将htmlStringCluttered
定义为:
htmlStringCluttered = document.documentElement.innerHTML;
而不是:
htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
并在http://tv.dir.bg/tv_search.php?step=1&all=1
的控制台中运行它您还必须注释掉该行:
console.log(url);
或声明url
并给它一个值。
<强>原始强>
你的RegExp需要一些工作,当分成两个replace
语句时,它更简单(也更容易阅读):
var htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
htmlString = '';
console.log("Content grabbed (schedule for next 7 days)");
console.log(url);
//eliminate the '<head>' section
htmlString = htmlStringCluttered.replace(/(<head[\s\S]*<\/head>)/ig, '')
//eliminate any remaining '<script>' elements
htmlString = htmlString.replace(/(<script[\s\S]+?<\/script>)/ig, '');
//log remaining as output
console.log(htmlString);
通过访问http://tv.dir.bg/tv_search.php?step=1&all=1并在控制台中运行以下内容,在控制台中对此进行了测试:
console.log(document.documentElement.innerHTML.replace(/(<head[\s\S]*<\/head>)/ig, '').replace(/(<script[\s\S]+?<\/script>)/ig, ''));
如果在outerHTML
属性上运行(因为我希望返回HTML.getHTML(new URL(url), "WINDOWS-1251")
方法),那么<body>
元素将被包装:
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
...
</body>
</html>