Question

一些背景知识：我们正在为Middleman项目添加样式指南。这是供其他开发人员使用的，因此我们希望我们的代码示例可读。但是，当我们更改组件时，我们不希望在多个位置更新代码。

我们使用redcarpet进行降价解析和创建代码示例。

def TopFuzzMatch(df_cam, df_dict):
    """
    Calculates similarity bewteen two tokens and return TOP match
    The idea is to do it only over distinct values in given DF (takes ages otherwise)
    -----------------------------------------------------------------------
    df_cam: DataFrame with client id and origin
    df_dict: DataFrame with abbreviation which is matched with the description i need
    -----------------------------------------------------------------------
    """
    #Clean special characters and numbers
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1)

    #Get unique values and calculate similarity
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel())
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin]

    #To DataFrame
    df_match = pd.DataFrame({'unique': uq_origin})
    df_match['top_match'] = top_match

    #Merge
    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique')
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut')

    return df_cam

df_out = TopFuzzMatch(df_cam, df_dict)

然而，这会留下非常混乱且难以理解的代码示例。我们可以使用htmlbeautifier很好地清理它们。但是我们仍然存在HTML标记内的多个空格和换行符的问题。

通常看起来像这样：

<%= partial '../partials/component' %>

```html
    <%= partial '../partials/component' %>
```

我们希望删除标记内的额外空格和换行符，即<article class="default-s-sans teaser-media" data-item-ratio="16x9" data-background-color="d-blue" >和<之间。但不是在元素之间，所以它应该保持不变：

我得到了this far：

<div>
    <span class="price">$100</span>
    <span>
       Word     word
    </span>
</div>

但如果之间没有其他内容，它只会匹配html.gsub(/(?<=<)(\s{2,})(?>)/, ' ')和<之间的空格。

如何匹配>和<之间的空格，但也允许其他字符？

Answer 1

您可以使用matchdata object in gsub blocks：

html.gsub(/(?<=<)(.+)(?>)/m) { |match| match.gsub(/\n/, ' ').gsub(/\s+/, ' ') }

Answer 2

String#sqeeze来救援：

squeeze([other_str]*)→new_str

使用other_str描述的过程从String#count参数构建一组字符。返回一个新字符串，其中此集合中出现的相同字符的运行将替换为单个字符。如果没有给出参数，则所有相同字符的运行都将替换为单个字符。
"yellow moon".squeeze                  #=> "yelow mon"
"  now   is  the".squeeze(" ")         #=> " now is the"
"putters shoot balls".squeeze("m-z")   #=> "puters shot balls"

正则表达式删除HTML标记内的多个空格和换行符

2 个答案: