从HTML代码中解析以下内容?

时间:2012-10-23 20:38:46

标签: c# regex

我将如何解析以下内容:

wr("website-url.com</span>")

来自HTML代码的正则表达式?

似乎无法弄清楚如何提取website-url.com

HTML中的整个JavaScript:

<script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>

尝试正则表达式:

wr("(.+?)\s*<\/span>")

但似乎无法让它发挥作用

4 个答案:

答案 0 :(得分:0)

string a = <script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>;
string[] b= a.replace("script type="text/javascript">","").replace("</script>","").split(';').ToArray();
string c = b.Last();
string d = c.replace("wr(","").replace("</span","");

d是最终结果,但您可以修改代码以处理字符串中的双引号。

答案 1 :(得分:0)

似乎您获得此JavaScript的网站不希望您解析其HTML。它使用javascript函数wr创建动态html。下面是执行此javascript并解析生成的代码的代码。 Hovewer我不能说这是一个简单的代码来追踪

public void Test()
{
    //C# object which will be accessed by javascript
    var csharpObj = new MyCSharpObject();

    //Create Javascript object
    Type scriptType = Type.GetTypeFromCLSID(Guid.Parse("0E59F1D5-1FBE-11D0-8FF2-00A0D10038BC"));
    dynamic obj = Activator.CreateInstance(scriptType, false);
    obj.Language = "Javascript";
    obj.AddObject("csharp", csharpObj);

    //Load Html (your string in question)
    string html = @"<script type=""text/javascript"">wr(""<span>maddog"");wr(""@"");wr(""website-url.com</span>"")</script>";
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);

    //Create "wr" function
    string script = "function wr(s){csharp.wr(s);}";

    //Get the text of script tag                
    script += doc.DocumentNode.SelectSingleNode("//script").InnerText;

    //Execute script
    obj.Eval(script);

    //Load the string created by javascript execution
    doc.LoadHtml(csharpObj.Output);

    //tada.....
    var eMailAddress = doc.DocumentNode.InnerText;

    Console.WriteLine(eMailAddress);
}

[ComVisible(true)]
public class MyCSharpObject
{
    public string Output = "";
    public void wr(string s)
    {
        Output += s;
    }
}

<强> -------- -------- EDIT

  

我不知道怎么写“Get all the wr(*)strings

虽然看起来你想要这样的解决方案,但我不会依靠Regex来解析html

public void Test2()
{
    string html = @"<script type=""text/javascript"">wr(""<span>maddog"");wr(""@"");wr(""website-url.com</span>"")</script>";

    var parsedHtml = String.Join("",Regex.Matches(html, @"wr\(\""(.+?)\""\)")
                                            .Cast<Match>()
                                            .Select(m => m.Groups[1].Value));

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(parsedHtml);
    var eMailAddress = doc.DocumentNode.InnerText;
}

答案 2 :(得分:0)

这个想法是:

  • 使用一个正则表达式获取所有wr(*)字符串。
  • 删除引号(“)
  • 删除<span></span>

这是Python的解决方案。

import re

def geturl(text):
    '''
    Get all the wr(*) strings.
    Remove quotes.
    Remove <span> and </span>
    '''
    regex = re.compile(r'wr\(([^)]*)\)')
    match = regex.findall(xx)
    url = ''.join([s.replace('"', '') for s in match])
    url = url.replace('<span>', '').replace('</span>', '')
    return url

if __name__ == '__main__':
    xx = '''<script type="text/javascript">wr("<span>maddog");wr("@");wr("website-url.com</span>")</script>'''
    url = geturl(xx)
    print url

提供maddog@website-url.com

答案 3 :(得分:-1)

如果您正在使用正则表达式来解析HTML,那么您可能正在以一种简单的方式做一些简单的事情。在C#中,尝试使用HTML Agility Pack。另请参阅有关此事的definitive question