Question

我正在尝试清理大约1400个降价文件。作为其中的一部分，我需要捕获字符串并将其替换为文件，但仅限于某个部分之后。

以下是示例文件：

---
title: 'This is the post&#8217;s title'
author: foobar
date: 2007-12-04 12:41:01 -0800
layout: post
permalink: /2007/12/04/foo/
categories:
  - General
---


Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta &#8217; sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur.

&#8217;

我想用’替换所有'字符串，但仅在标题之后。

我可以用这个来捕捉标题：

(---((.|\n)*?)---)

但我很难在标题后捕获其余文本。

有什么建议吗？我正在使用TextMate，但也可以在终端中执行此操作（在Mac上）。

Answer 1

awk可以通过计算标题分隔线

来完成此操作

awk -v quote="'" '/^---$/ { header++} { if (header >= 2) { gsub("&#8217;", quote); }}1' infile > outfile

Answer 2

在文字中：

search: ((?:---(?>[^-]++|-(?!--))*---|\G(?<!\A))(?>[^&]++|&(?!#8217;))*)&#8217;
replace: $1'

模式细节：

(                    # capture group 1: all possible content before &#8217;
    (?:              # non capturing group: possible "anchors"
        ---          # begining of the header: entry point
        (?>          # atomic group: possible content of the header
            [^-]++   # all that is not a -
          |          # OR
            -(?!--)  # a - not followed by --
        )*           # repeat the atomic group zero or more times
        ---          # end of the header
      |              # OR
        \G(?<!\A)    # contiguous to a precedent match (not at the start)
    )                # close the non capturing group
    (?>              # atomic group: all that is not &#8217;
        [^&]++       # all character except &
      |              # OR
        &(?!#8217;)  # & not followed by #8217;
    )*               # repeat the atomic group zero or more times
)                    # close the capturing group
&#8217;

我们的想法是使用\G功能只允许连续匹配。

第一场比赛：入口点是标题。找到标题后（第一种可能性，在非捕获组中），模式匹配所有不是’（第二个原子组）直到’。

其他匹配：\G强制其他匹配与先例相邻。第二场比赛从第一场比赛开始，第三场比赛从第二场比赛开始，等等。

仅替换文档的一部分中的字符串

2 个答案: