屏幕抓取页面以使用C#和MVC将HTML注入主页面

时间:2012-08-07 09:34:09

标签: c# model-view-controller screen-scraping

我想开发一种方法,可以将.Net MVC Web应用程序包装成与其链接的站点的正确外观。

基本上我想为包含的网站存储一个“参考页面”的URL,我的应用程序将用它来筛选页眉/页脚HTML,以便在其母版页中使用。

因此,如果/当网站(从CMS输出)改变其结构/图像/颜色时,我的应用程序将简单地使用新创建的“模板”并相应地换行。

在'模板'中设置了开始/结束div标签,所以我只需要屏幕抓取HTML,将其拆分到相关点,然后以某种方式将其注入到我的应用程序的MasterPage中。

屏幕抓取部分看起来相当简单,这是注入母版页面时我遇到了整理问题。

非常感谢任何帮助。 :)

编辑 - 我目前正在计划这个,并且没有代码可以发布。正如我所说,屏幕抓取部分看起来很好,但是我如何将从页眉/页脚的“参考页面”中提取的相关HTML插入/注入我的应用程序使用的母版页?

1 个答案:

答案 0 :(得分:0)

我知道你可能已经解决了这个问题,但这里有一个适用于母版页和MVC(以及ASP.Net表格)的解决方案。

我首先尝试覆盖母版页的Render方法,并使用RenderControl渲染ContentPlaceHolders,并用渲染结果替换模板中的某些标记。这适用于ASP.Net表单,但不适用于MVC - 这种方式<% using (Html.BeginForm("A","B")) { %>总是导致在doctype之前在页面的最顶部呈现表单标记。

<强>解决方案

检索模板并将其拆分为部分,一些是文本部分,一些是占位符部分。在您的母版页中,您有一个HTML文档和占位符 - 不仅是您的占位符。这样VS设计师就不会抱怨。但是,在渲染时,首先清除Controls集合,然后将每个部分添加为LiteralControl或ContentPlaceHolder。您只需将实际渲染保留为ASP.Net即可。以下是灵感代码。

母版页:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title runat="server"></title>
    <asp:PlaceHolder ID="HeadPlaceHolder" runat="server">
        <script type="text/javascript" src="/cnnet/Resources/Js/jquery-1.8.1.min.js"></script>
    </asp:PlaceHolder>
    <asp:ContentPlaceHolder ID="HeadContentPlaceHolder" runat="server"/>
</head>
<body>
    <asp:ContentPlaceHolder ID="MainContentPlaceHolder" runat="server" />
</body>
</html>

主页代码隐藏:

private HtmlHead originalPageHeader;
static readonly Regex HeadStartRegex = new Regex(@"^\s*<head[^>]*>");
static readonly Regex HeadEndRegex = new Regex(@"</head>\s*$");
static readonly Regex TitleRegex = new Regex(@"<title>[^<]*</title>");

public Default() { Init += Default_Init; }

private void Default_Init(object sender, EventArgs e) { DoScraping(); }

protected override void Render(HtmlTextWriter writer)
{
    // get content from html head control generated via Page.Header:
    string headHtml = RenderControl(originalPageHeader);
    Controls.Remove(originalPageHeader);
    headHtml = HeadStartRegex.Replace(headHtml, string.Empty);
    headHtml = HeadEndRegex.Replace(headHtml, string.Empty);
    headHtml = TitleRegex.Replace(headHtml, string.Empty);
    // head.Controls.Add(new LiteralControl(headHtml)); doesnt work if head content placeholder contains code blocks (i.e. <% ... %>)
    // Instead add content this way:
    int headIndex = Controls.IndexOf(HeadContentPlaceHolder);
    if (headIndex != -1)
        Controls.AddAt(headIndex + 1, new LiteralControl(headHtml));

    base.Render(writer);
}

private void DoScraping()
{
    IList<PagePart> parts = ... // do your scraping and splitting into parts
    Controls.Clear();

    foreach (PagePart part in parts)
    {
        var literalPart = part as LiteralPart;
        if (literalPart != null)
        {
            Controls.Add(new LiteralControl(literalPart.Text));
        }
        else
        {
            var placeHolderPart = part as PlaceHolderPart;
            switch (placeHolderPart.Type)
            {
                case PlaceHolderType.Title:
                    Controls.Add(new LiteralControl(HttpUtility.HtmlEncode(Page.Title)));
                    break;
                case PlaceHolderType.Head:
                    Controls.Add(HeadPlaceHolder);
                    Controls.Add(HeadContentPlaceHolder);
                    break;
                case PlaceHolderType.Main:
                    Controls.Add(new LiteralControl("<div class='boxContent'>"));
                    Controls.Add(MainContentPlaceHolder);
                    Controls.Add(new LiteralControl("<div/>"));
                    break;
            }
        }
    }
}

private string RenderControl(Control control)
{
    string innerHtml;
    using (var stringWriter = new StringWriter())
    {
        using (var writer = new HtmlTextWriter(stringWriter))
        {
            control.RenderControl(writer);
            writer.Flush();
            innerHtml = stringWriter.ToString();
        }
    }
    return innerHtml;
}

件:

public class PagePart {}

public class LiteralPart : PagePart
{
    public LiteralPart(string text) { Text = text; }
    public string Text { get; private set; }
}

public class PlaceHolderPart : PagePart
{
    public PlaceHolderPart(PlaceHolderType type) { Type = type; }
    public PlaceHolderType Type { get; private set; }
}

public enum PlaceHolderType { Title, Head, Main }

分裂:

class PlaceHolderInfo
{
    public PlaceHolderInfo(PlaceHolderType type, Regex splitter)
    {
        Type = type;
        Splitter = splitter;
    }

    public PlaceHolderType Type { get; private set; }
    public Regex Splitter { get; private set; }
}

private static readonly List<PlaceHolderInfo> PlaceHolderInfos = new List<PlaceHolderInfo>
    {
        new PlaceHolderInfo(PlaceHolderType.Title, new Regex(TitleString)),
        new PlaceHolderInfo(PlaceHolderType.Head, new Regex(HeadString)),
        new PlaceHolderInfo(PlaceHolderType.Main, new Regex(MainString)),
    };

private static List<PagePart> SplitPage(string html)
{
    var parts = new List<PagePart>(new PagePart[] { new LiteralPart(html) });
    foreach (PlaceHolderInfo info in placeHolderInfos)
    {
        var newParts = new List<PagePart>();
        foreach (PagePart part in parts)
        {
            if (part is PlaceHolderPart)
            {
                newParts.Add(part);
            }
            else
            {
                var literalPart = (LiteralPart)part;
                // Note about Regex.Split: if match is found in beginning or end of string, an empty string is returned in corresponding end of returned array.
                string[] split = info.Splitter.Split(literalPart.Text); 
                for (int i = 0; i < split.Length; i++)
                {
                    newParts.Add(new LiteralPart(split[i]));
                    if (i + 1 < split.Length) // If result of Split returned more than one string, it means there was a match and we insert the placeholder between each string
                        newParts.Add(new PlaceHolderPart(info.Type));
                }
            }
        }
        parts = newParts;
    }
    return parts;
}

请注意,此解决方案很容易扩展到更多占位符(面包屑,菜单,您可以命名)。它没有假设模板中占位符的顺序或它们的存在。

编辑1: 我最初使用DoScraping方法调用Render。事实证明这是有问题的,因为它在Web表单中对控件名称进行了重新编号(例如ctl00 $ MainContentPlaceHolder $ RequestingRepeater $ ctl01 $ ctl01)。它使数字搞砸到转发器内的按钮中OnCommand停止工作的程度。控件的重新排序必须尽早发生以避免这种情况,因此现在已将其移至Init

编辑2: 有些页面使用Page.Header来生成样式和脚本标记。为了支持此功能,我添加了一些黑客以保留原始<head>标记并在渲染时插入生成的内容。