我将如何解析以下内容:
wr("website-url.com</span>")
来自HTML代码的正则表达式?
似乎无法弄清楚如何提取website-url.com
HTML中的整个JavaScript:
<script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>
尝试正则表达式:
wr("(.+?)\s*<\/span>")
但似乎无法让它发挥作用
答案 0 :(得分:0)
string a = <script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>;
string[] b= a.replace("script type="text/javascript">","").replace("</script>","").split(';').ToArray();
string c = b.Last();
string d = c.replace("wr(","").replace("</span","");
d是最终结果,但您可以修改代码以处理字符串中的双引号。
答案 1 :(得分:0)
似乎您获得此JavaScript的网站不希望您解析其HTML。它使用javascript函数wr
创建动态html。下面是执行此javascript并解析生成的代码的代码。 Hovewer我不能说这是一个简单的代码来追踪
public void Test()
{
//C# object which will be accessed by javascript
var csharpObj = new MyCSharpObject();
//Create Javascript object
Type scriptType = Type.GetTypeFromCLSID(Guid.Parse("0E59F1D5-1FBE-11D0-8FF2-00A0D10038BC"));
dynamic obj = Activator.CreateInstance(scriptType, false);
obj.Language = "Javascript";
obj.AddObject("csharp", csharpObj);
//Load Html (your string in question)
string html = @"<script type=""text/javascript"">wr(""<span>maddog"");wr(""@"");wr(""website-url.com</span>"")</script>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Create "wr" function
string script = "function wr(s){csharp.wr(s);}";
//Get the text of script tag
script += doc.DocumentNode.SelectSingleNode("//script").InnerText;
//Execute script
obj.Eval(script);
//Load the string created by javascript execution
doc.LoadHtml(csharpObj.Output);
//tada.....
var eMailAddress = doc.DocumentNode.InnerText;
Console.WriteLine(eMailAddress);
}
[ComVisible(true)]
public class MyCSharpObject
{
public string Output = "";
public void wr(string s)
{
Output += s;
}
}
<强> -------- -------- EDIT 强>
我不知道怎么写“Get all the wr(*)strings
虽然看起来你想要这样的解决方案,但我不会依靠Regex来解析html
public void Test2()
{
string html = @"<script type=""text/javascript"">wr(""<span>maddog"");wr(""@"");wr(""website-url.com</span>"")</script>";
var parsedHtml = String.Join("",Regex.Matches(html, @"wr\(\""(.+?)\""\)")
.Cast<Match>()
.Select(m => m.Groups[1].Value));
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(parsedHtml);
var eMailAddress = doc.DocumentNode.InnerText;
}
答案 2 :(得分:0)
这个想法是:
<span>
和</span>
这是Python的解决方案。
import re
def geturl(text):
'''
Get all the wr(*) strings.
Remove quotes.
Remove <span> and </span>
'''
regex = re.compile(r'wr\(([^)]*)\)')
match = regex.findall(xx)
url = ''.join([s.replace('"', '') for s in match])
url = url.replace('<span>', '').replace('</span>', '')
return url
if __name__ == '__main__':
xx = '''<script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>'''
url = geturl(xx)
print url
提供maddog@website-url.com
答案 3 :(得分:-1)
如果您正在使用正则表达式来解析HTML,那么您可能正在以一种简单的方式做一些简单的事情。在C#中,尝试使用HTML Agility Pack。另请参阅有关此事的definitive question。