Question

我目前正在将内容从一个CMS导出/导入另一个CMS。

我有出口到位。我将旧CMS中的所有内容导出到XML文件，保留文档的结构等。导入也就位，映射到新的PageTypes，映射文本字段等。我还导出并导入了从旧CMS到新CMS的所有媒体。

我唯一关心的问题是，在每个页面的RichText字段内处理内部链接和媒体项链接。

因此，每个页面都包含一个Header，一些通用信息和一个RichTextField，其中包含HTML页面内容。此字段可以包含指向同一站点内其他页面的链接，以及内部链接和媒体项链接。

我的问题是，我怎样才能找到这些，并将它们映射到我的新结构。

所有内部链接都是这样的：<a href="/mycms/~/link.aspx?_id=D9423CEFED254610A5DC6B096A297E17&_z=z">...</a>（可能在某些链接上有更多属性，例如style=".."，class=".."等。 ID是对旧CMS的ID的引用，它总是32个字符长。

媒体项目（图片）可能如下所示：<img src="/mycms/~/media/B1FB91AC357347BD84913D56B8791D03.ashx" alt="" width="690" height="202" />。此外，id总是32个字符。

在导入期间，我生成了一个json文件，其中包含旧CMS中的所有mediaId，并将其映射到新CMS中的新ID。所以它看起来像这样;

{
    "{0CFBBD0A-9156-4AD9-8A8A-7D30B2D7213B}":1095,
    "{BE9BEAAA-F04D-42DA-B52A-44B4B31A389E}":1096,
    etc.
}

请注意旧CMS ID的ID格式与链接和媒体中使用的格式不同。剥去花括号和短划线，它会匹配。

最好的方法是什么？我猜一个RegEx将是要走的路 - 但是那会是什么样子？

谢谢：）

Answer 1

您最好的选择是使用类似HtmlAgilityPack的内容。纯正则表达式通常太粗糙无法成功解析HTML ...这不是一项不可能的任务，但比使用HtmlAgilityPack更难。

The post Eric在他的评论中链接的是历史上臭名昭着的StackOverflow和多个回复，其中详细介绍了为什么不推荐使用Regex解析HTML的方法。根据我的个人经验提供TLDR：HTML页面通常充满了小错误＆＃34;。例如，您经常会有<img>个未正确关闭的标记（例如<img />）。确定性匹配和替换也很困难。

因此，尝试使用正确的工具来完成工作 - 在这种情况下，正确的工具是HtmlAgilityPack。

当谈到HtmlAgilityPack的使用时 - they have good documentation。在您的情况下，您可能希望查看Replace Child功能。要从他们的文档中重现示例，请使用以下测试HTML：

<body>
    <h1>This is <b>bold</b> heading</h1>
    <p>This is <u>underlined</u> paragraph</p>
</body>

要操纵它，并替换您要执行的<h1>节点：

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html); // where html = @"content previously mentioned"

var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
HtmlNode oldChild = htmlBody.ChildNodes[1];     
HtmlNode newChild = HtmlNode.CreateNode("<h2> This is h2 new child heading</h2>");      

htmlBody.ReplaceChild(newChild, oldChild);
// now htmlBody has <h2> node instead of old <h1>

在您的情况下，您可能希望使用SelectNodes而不是SelectSingleNode，在XPath中，您将定位要替换的元素。在列表中包含这些元素后，您将迭代它们并根据条件替换内容。

要记住一件事 - 因为您的ID非常详细，有32个字符，您可能会将它们与纯字符串搜索匹配。所以如果你不是针对某些HTML元素，而是ID - 那么你甚至不需要使用HtmlAgilityPack或Regex - 做简单的String.Replace("OLDUID", "NEWUID")。

Answer 2

如果你将非HTML与html混合使用，最好使用正则表达式这是一种进行替换的方法。

链接：

(?i)(<a)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(['"])/mycms/~/link\.aspx\?_id=)([a-f0-9]{32})(&_z=z\3(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

替换为$1$2 + key{$4} + $5
其中key{$4}是字典中的新链接ID值。

https://regex101.com/r/xRf1xN/1

 # https://regex101.com/r/ieEBj8/1

 (?i)                              # Case insensitive modifier
 ( < a )                           # (1), The a tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the ID num
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?

           \s href \s* = \s*                 # href attribute
           ( ['"] )                          # (3), Quote
           /mycms/~/link\.aspx\?_id=         # Prefix link static text
      )                                 # (2 end)

      ( [a-f0-9]{32} )                  # (4), hex link ID

      (                                 # (5 start), All past the ID num
           &amp;_z=z                         # Postfix link static text
           \3                                # End quote

                                             # The remainder of the tag parts
           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (5 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

媒体：

(?i)(<img)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\ssrc\s*=\s*(['"])/mycms/~/media/)([a-f0-9]{32})(\.ashx\3(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

替换为$1$2 + key{$4} + $5
其中key{$4}是字典中的新媒体ID值。

https://regex101.com/r/pwyjoK/1

 # https://regex101.com/r/ieEBj8/1

 (?i)                              # Case insensitive modifier
 ( < img )                         # (1), The img tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the ID num
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?

           \s src \s* = \s*                  # src attribute
           ( ['"] )                          # (3), Quote
           /mycms/~/media/                   # Prefix media static text
      )                                 # (2 end)

      ( [a-f0-9]{32} )                  # (4), hex media ID

      (                                 # (5 start), All past the ID num
           \.ashx                            # Postfix media static text
           \3                                # End quote

                                             # The remainder of the tag parts
           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (5 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

如果我想a）在link / src标签中提取ID，b）替换整个href =＆＃34; ..＆＃34;或src =＆＃34; ..＆＃34;价值（而不是ID部分，在RegEx中看起来如何？

要执行此操作，只需重新排列捕获组即可。

链接：

(?i)(<a)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\s)(href\s*=\s*(['"])/mycms/~/link\.aspx\?_id=([a-f0-9]{32})&_z=z\4)((?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

替换为$1$2href='NEWID:key{$5}'$6
其中key{$5}是字典中的新链接ID值。

https://regex101.com/r/FxpJVl/1

 (?i)                              # Case insensitive modifier
 ( < a )                           # (1), The a tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the href attribute
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s 
      )                                 # (2 end)
      (                                 # (3 start), href attribute
           href \s* = \s* 
           ( ['"] )                          # (4), Quote
           /mycms/~/link\.aspx\?_id=         # Prefix link static text


           ( [a-f0-9]{32} )                  # (5), hex link ID


           &amp;_z=z                         # Postfix link static text
           \4                                # End quote
      )                                 # (3 end)
      (                                 # (6 start), remainder of the tag parts

           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (6 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

媒体：

(?i)(<img)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\s)(src\s*=\s*(['"])/mycms/~/media/([a-f0-9]{32})\.ashx\4)((?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

替换为$1$2src='NEWID:key{$5}'$6
其中key{$5}是字典中的新媒体ID值。

https://regex101.com/r/EqKYjM/1

 (?i)                              # Case insensitive modifier
 ( < img )                         # (1), The img tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the src attribute
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s 
      )                                 # (2 end)
      (                                 # (3 start), src attribute
           src \s* = \s* 
           ( ['"] )                          # (4), Quote
           /mycms/~/media/                   # Prefix media static text

           ( [a-f0-9]{32} )                  # (5), hex media ID

           \.ashx                            # Postfix media static text
           \4                                # End quote
      )                                 # (3 end)
      (                                 # (6 start), remainder of the tag parts

           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (6 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

在HTML中查找内容并替换它

2 个答案: