Question

我要做什么

我正在构建一个Jekyll ruby插件，它将用链接到同名帖子URL的超链接替换帖子副本文本内容中任何单词的首次出现。

我遇到的问题

我已经做到了这一点，但是我无法找出process_words方法中的两个问题：

如何仅在帖子的主要内容复制文本中搜索帖子标题，而不是在帖子或目录（也在主要帖子复制文本之前生成）中搜索元标记？我不能让它与Nokigiri一起使用，即使这似乎是这里的首选工具。
如果帖子的网址不在post.data['url']上，它在哪里？
还有，有没有更有效，更清洁的方式来做到这一点？

当前代码有效，但是即使它是HTML属性（例如锚点或元标记）的值，也将替换第一个匹配项。

示例结果

我们有一个博客，其中包含3个帖子：

兴趣爱好
食物
自行车

在“兴趣爱好”帖子正文中，我们在句子中首次出现了一个单词，每个单词都出现在其中，如下所示：

I love mountain biking and bicycles in general.

插件将处理该句子并将其输出为：

I love mountain biking and <a href="https://example.com/link/to/bicycles/">bicycles</a> in general.

我当前的代码（已更新1）

# _plugins/hyperlink_first_word_occurance.rb
require "jekyll"
require 'uri'


module Jekyll

    # Replace the first occurance of each post title in the content with the post's title hyperlink
    module HyperlinkFirstWordOccurance
        POST_CONTENT_CLASS = "page__content"
        BODY_START_TAG = "<body"
        ASIDE_START_TAG = "<aside"
        OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!
        CLOSING_ASIDE_TAG_REGEX = %r!</aside(.*)>\s*!

        class << self
            # Public: Processes the content and updates the 
            # first occurance of each word that also has a post
            # of the same title, into a hyperlink.
            #
            # content - the document or page to be processes.
            def process(content)
                @title = content.data['title']
                @posts = content.site.posts

                content.output = if content.output.include? BODY_START_TAG
                                    process_html(content)
                                else
                                    process_words(content.output)
                                end
            end


            # Public: Determines if the content should be processed.
            #
            # doc - the document being processes.
            def processable?(doc)
                (doc.is_a?(Jekyll::Page) || doc.write?) &&
                    doc.output_ext == ".html" || (doc.permalink&.end_with?("/"))
            end


            private

            # Private: Processes html content which has a body opening tag.
            #
            # content - html to be processes.
            def process_html(content)
            content.output = if content.output.include? ASIDE_START_TAG
                    head, opener, tail = content.output.partition(CLOSING_ASIDE_TAG_REGEX)
                            else
                    head, opener, tail = content.output.partition(POST_CONTENT_CLASS)
                            end
                body_content, *rest = tail.partition("</body>")

                processed_markup = process_words(body_content)

                content.output = String.new(head) << opener << processed_markup << rest.join
            end

            # Private: Processes each word of the content and makes
            # the first occurance of each word that also has a post
            # of the same title, into a hyperlink.
            #
            # html = the html which includes all the content.
            def process_words(html)
                page_content = html
                @posts.docs.each do |post|
                    post_title = post.data['title'] || post.name
                    post_title_lowercase = post_title.downcase
                    if post_title != @title
                        if page_content.include?(" " + post_title_lowercase + " ") ||
                            page_content.include?(post_title_lowercase + " ") ||
                            page_content.include?(post_title_lowercase + ",") ||
                            page_content.include?(post_title_lowercase + ".")
                            page_content = page_content.sub(post_title_lowercase, "<a href=\"#{ post.url }\">#{ post_title.downcase }</a>")
                        elsif page_content.include?(" " + post_title + " ") ||
                            page_content.include?(post_title + " ") ||
                            page_content.include?(post_title + ",") ||
                            page_content.include?(post_title + ".")
                            page_content = page_content.sub(post_title, "<a href=\"#{ post.data['url'] }\">#{ post_title }</a>")
                        end
                    end
                end
                page_content
            end
        end
    end
end


Jekyll::Hooks.register %i[posts pages], :post_render do |doc|
  # code to call after Jekyll renders a post
  Jekyll::HyperlinkFirstWordOccurance.process(doc) if Jekyll::HyperlinkFirstWordOccurance.processable?(doc)
end

更新1

根据@Keith Mifsud的建议更新了我的代码。现在，使用侧边栏的aside元素或page__content类选择要处理的正文内容。

还改进了检查和替换正确术语的方法。

PS：我从开发插件开始的代码示例是@Keith Mifsud的jekyll-target-blank plugin

Answer 1

此代码看起来非常熟悉:)我建议您查看Rspecs测试文件以测试您的问题：https://github.com/keithmifsud/jekyll-target-blank

我会尽力回答您的问题，对不起，我在撰写本文时无法自己测试这些问题。

如何仅在帖子的主要内容复制文本中搜索帖子标题，而不是在帖子或目录（也在主要帖子复制文本之前生成）中搜索元标记？我无法让它与Nokigiri一起使用，即使这似乎是这里的首选工具。

您的要求是：

1）忽略<body></body>标记之外的内容。

这似乎已经在process_html()方法中实现了。此方法说明了唯一的body_content进程，它应该按原样工作。你有测试吗？您如何调试它？相同的字符串拆分在我的插件中也有效。即仅处理体内的内容。

2）忽略目录（TOC）中的内容。我建议您通过进一步拆分process_html()变量来扩展body_content方法。在目录的开始和结束标记之间（通过id，css类等）搜索内容并将其排除在外，然后将其重新添加到process_words字符串之前或之后的位置。

3）是否使用Nokigiri插件？这个插件非常适合解析html。我认为您正在解析字符串，然后创建html。因此，香草Ruby和URI插件就足够了。如果需要，您仍然可以使用它，但是将字符串分割成ruby不会更快。

如果帖子的URL不在post.data ['url']处，那么它在哪里？

我认为您应该拥有一个方法来获取所有所有帖子标题，然后将“ words”与数组匹配。您可以从文档本身doc.site.posts获取所有帖子集合，并且foreach帖子返回标题。 process_words()方法可以检查每个工作，以查看它是否与数组中的项目匹配。但是，如果标题由多个单词组成，该怎么办？

还有，有没有更有效，更清洁的方式来做到这一点？

到目前为止，一切都很好。我将从解决问题开始，然后重构速度和编码标准。

同样，我建议您使用测试来帮助您解决这个问题。

让我知道是否可以提供更多帮助：）

处理Jekyll内容，以该帖子的超链接替换该帖子标题的首次出现

我要做什么

我遇到的问题

示例结果

我当前的代码（已更新1）

更新1

1 个答案: