Question

我想使用Jekyll将我的WordPress网站转换为GitHub上的静态网站。

我使用了一个插件，将我的62个帖子导出为GitHub作为Markdown。我现在在每个文件的开头都有这些帖子和额外的frontmatter。它看起来像这样：

---
ID: 51
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
post_excerpt: ""
layout: post
permalink: >
  https://myurl.com/slug
published: true
sw_timestamp:
  - "399956"
sw_open_thumbnail_url:
  - >
    https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
sw_cache_timestamp:
  - "408644"
swp_open_thumbnail_url:
  - >
    https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
swp_open_graph_image_data:
  - '["https://i0.wp.com/myurl.com/wp-content/uploads/2014/08/Featured_image.jpg?fit=800%2C400&ssl=1",800,400,false]'
swp_cache_timestamp:
  - "410228"
---

这个区块没有被Jekyll解析，而且我不需要所有这些前线。我想让每个文件的前端转换为

---
ID: 51
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
layout: post
published: true
---

我想用正则表达式来做这件事。但我对正则表达式的了解并不是那么好。在这个论坛的帮助下以及大量的Google搜索中，我并没有走得太远。我知道如何找到完整的frontmatter，但如何用上面指定的部分替换呢？

我可能必须分步进行，但我无法解决如何做到这一点。

我使用Textwrangler作为编辑器来进行搜索和替换。

Answer 1

编辑我的帖子，因为我第一次误解了这个问题，我没能理解实际的帖子是在同一个文件中，就在---

之后

使用egrep和GNU sed，所以不是内置的bash，它相对容易：

# create a working copy
mv file file.old
# get only the fields you need from the frontmatter and redirect that to a new file
egrep '(---|ID|post_title|author|post_date|layout|published)' file.old > file
# get everything from the old file, but discard the frontmatter
cat file.old |gsed '/---/,/---/ d' >> file
# remove working copy
rm file.old

如果你想要一起去：

for i in `ls`; do mv $i $i.old; egrep '(---|ID|post_title|author|post_date|layout|published)' $i.old > $i; cat $.old |gsed '/---/,/---/ d' >> $i; rm $i.old; done

为了更好的衡量，这是我写的第一个回复：

=============================================== ============

我认为你这么复杂。

一个简单的egrep会做你想要的：

egrep '(---|ID|post_title|author|post_date|layout|published)' file

重定向到新文件：

egrep '(---|ID|post_title|author|post_date|layout|published)' file > newfile

一个完整的目录：

for i in `ls`; do egrep '(---|ID|post_title|author|post_date|layout|published)' $i > $i.new; done

Answer 2

在像你这样的情况下，最好使用实际的YAML解析器和一些脚本语言。将每个文件中的元数据切断为独立文件（或字符串），然后使用YAML库加载元数据。加载元数据后，您可以安全地修改它们而不会有任何问题。然后使用同一个库中的serialize方法创建一个新的元数据文件，最后将这些文件重新组合在一起。

这样的事情：

<?php
list ($before, $metadata, $after) = preg_split("/\n----*\n/ms", file_get_contents($argv[1]));
$yaml = yaml_parse($metadata);
$yaml_copy = [];
foreach ($yaml as $k => $v) {
    // copy the data you wish to preserve to $yaml_copy
    if (...) {
        $yaml_copy[$k] = $yaml[$k];
    }
}
file_put_contents('new/'.$argv[1], $before."\n---\n".yaml_emit($yaml_copy)."\n---\n".$after);

（这只是一个未经测试的草案，没有错误检查。）

Answer 3

你可以用这样的gawk来做：

gawk 'BEGIN {RS="---"; FS="\000" } (FNR == 2) { print "---"; split($1, fm, "\n");  for (line in fm) { if ( fm[line] ~ /^(ID|post_title|author|post_date|layout|published):/) {print fm[line]}  }  print "---"   } (FNR > 2) {print}' post1.html > post1_without_frontmatter_fields.html

Answer 4

您基本上想要编辑该文件。这就是sed（流编辑器）的用途。

sed -es / ^ ID：（*）$ ^ post_title：（）$ ^ author：（）$ ^ postdate：（）$ ^布局：（）$ ^发布：（）$ / ID：\ 1 \ npost_title：\ 2 \ nauthor：\ 3 \ npostdate：\ 4 \ nlayout：\ 5 \ npublished：\ 6 /克

Answer 5

YAML（和其他相对自由的格式，如HTML，JSON，XML）最好不要使用正则表达式进行转换，它很容易为一个示例工作，并为具有额外空格，不同缩进等的下一个示例打破。

在这种情况下使用YAML解析器并非易事，因为许多人要么希望文件中有一个YAML文档（而Markdown部分上的barf作为无关的东西），要么期望文件中有多个YAML文档（和barf因为Markdown不是YAML）。此外，大多数YAML解析器都会丢弃注释和重新排序映射键等有用的东西。

我的ToDo项目使用了类似的格式（YAML标题，后跟reStructuredText）多年，并使用一个小的Python程序来提取和更新这些文件。给出这样的输入：

---
ID: 51     # one of the key/values to preserve
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
post_excerpt: ""
layout: post
permalink: >
  https://myurl.com/slug
published: true
sw_timestamp:
  - "399956"
sw_open_thumbnail_url:
  - >
    https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
sw_cache_timestamp:
  - "408644"
swp_open_thumbnail_url:
  - >
    https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
swp_open_graph_image_data:
  - '["https://i0.wp.com/myurl.com/wp-content/uploads/2014/08/Featured_image.jpg?fit=800%2C400&ssl=1",800,400,false]'
swp_cache_timestamp:
  - "410228"
---
additional stuff that is not YAML
  and more
  and more

这个程序¹：

import sys
import ruamel.yaml

from pathlib import Path


def extract(file_name, position=0):
    doc_nr = 0
    if not isinstance(file_name, Path):
        file_name = Path(file_name)
    yaml_str = ""
    with file_name.open() as fp:
        for line_nr, line in enumerate(fp):
            if line.startswith('---'):
                if line_nr == 0:  # don't count --- on first line as next document
                    continue
                else:
                    doc_nr += 1
            if position == doc_nr:
                yaml_str += line
    return ruamel.yaml.round_trip_load(yaml_str, preserve_quotes=True)


def reinsert(ofp, file_name, data, position=0):
    doc_nr = 0
    inserted = False
    if not isinstance(file_name, Path):
        file_name = Path(file_name)
    with file_name.open() as fp:
        for line_nr, line in enumerate(fp):
            if line.startswith('---'):
                if line_nr == 0:
                    ofp.write(line)
                    continue
                else:
                    doc_nr += 1
            if position == doc_nr:
                if inserted:
                    continue
                ruamel.yaml.round_trip_dump(data, ofp)
                inserted = True
                continue
            ofp.write(line)


data = extract('input.yaml')
for k in list(data.keys()):
    if k not in ['ID', 'post_title', 'author', 'post_date', 'layout', 'published']:
        del data[k]

reinsert(sys.stdout, 'input.yaml', data)

你得到这个输出：

---
ID: 51     # one of the key/values to preserve
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
layout: post
published: true
---
additional stuff that is not YAML
  and more
  and more

请注意，ID行的评论已妥善保留。

¹_{这是使用ruamel.yaml一个YAML 1.2解析器完成的，该解析器试图在往返时保留尽可能多的信息，我是作者。}

Answer 6

您也可以使用python-frontmatter：

import frontmatter
import io
from os.path import basename, splitext
import glob

# Where are the files to modify
path = "*.markdown"

# Loop through all files
for fname in glob.glob(path):
    with io.open(fname, 'r') as f:
        # Parse file's front matter
        post = frontmatter.load(f)
        for k in post.metadata:
           if k not in ['ID', 'post_title', 'author', 'post_date', 'layout', 'published']:
        del post[k]

        # Save the modified file
        newfile = io.open(fname, 'w', encoding='utf8')
        frontmatter.dump(post, newfile)
        newfile.close()

如果您想查看更多示例，请访问this page

希望它有所帮助。

用正则表达式

6 个答案: