Question

我正在编写一个解析器，该解析器从隐藏的iframe中获取数据。

在文本中，我需要将\n（↵）字符替换为（空格）。我将其用于此任务-text.replace(/\n/gi, " ")。但是，它仅适用于可见元素（即没有display: none）。如果该元素不可见（display: none），则换行符将消失并且不会得到任何替换。

HTML示例：

<div data-custom="languages">
    <div>
        <div>
            <h2>
                <span>Just a text that will be removed</span>
            </h2>
            <p>A - b</p>
            <p>c - d</p>
        </div>
    </div>
</div>

JS示例：

visibleIframe.style.display = "block";
invisibleIframe.style.display = "none";

const visibleDivWithNestedDivs = visibleIframe.querySelector(`[data-custom="languages"]`);
const invisibleDivWithNestedDivs = invisibleIframe.querySelector(`[data-custom="languages"]`);

const visibleText = visibleDivWithNestedDivs.innerText; // "A - b↵c - d"
const invisibleText = invisibleDivWithNestedDivs.innerText; // "A - b↵c - d"

console.log(visibleText.replace(/\n/gi, " ")); // "A - b c - d" (expected result)
console.log(invisibleText.replace(/\n/gi, " ")); // "A - bc - d" (unexpected result, no space between "b" and "c")

我尝试过的事情：

.replace(/\n/gi, " ")
.replace(/\r\n/gi, " ")
.replace(/↵/gi, " ")
.replace(/↵↵/gi, " ") // in some cases there was two of this.
.split("↵").join(" ") 
.split("\n").join(" ")
white-space: pre
white-space: pre-wrap

您要测试吗？

我有99％的把握是因为display: none。我对其进行了测试，不同的iframe展示给我不同的结果。

TextContent

我不需要textContent，因为这将返回不包含\n个字符的文本。我使用innerText。

问题：

出乎意料的结果可能不是因为display: none吗？
我应该如何实现预期的结果？

Answer 1

首先，让我们根据您提供的示例清除您似乎有的一些误解。

↵是一个Unicode字符，描述为带有向下角的向下箭头。当然，它可以很好地直观显示换行符或Return / Enter键，但是在代码中没有任何意义。如果在正则表达式中使用此符号，则正则表达式将尝试匹配包含箭头符号的文本。

在大多数编程语言中，字符串中的\n代表换行符，而不必为引擎盖下的表示方式而烦恼，无论是CR，LF还是同时使用这两者。因此，我不会在JavaScript中使用\r。

.replace(/\n/gi, " ")是一个完全有效的选项，具体取决于您要执行的操作。但是，您可能希望替换任何包含换行符的空白序列。在这种情况下，我会改用.replace(/\s+/, " ")。 RegExp中的\s特殊代码匹配任何类型的空白，包括换行符。添加+使其匹配任何空白序列。使用此方法可确保将这样的字符串"a \n \n b"转换为"a b"。

现在已经解决了正则表达式问题，让我们看一下innerText。根据我通过查看HTML Living Standard而发现的MDN article for innerText，innerText属性是用户从该元素复制粘贴文本时所得到的近似结果。定义如下：

如果未呈现此元素，或者用户代理为非CSS用户代理，则返回与此元素上的textContent IDL属性相同的值。注意：此步骤可能会产生令人惊讶的结果，因为当在未渲染的元素上访问innerText属性时，将返回其文本内容，但是在正在渲染的元素上访问时，其所有未渲染的子对象都具有他们的文字内容被忽略了。

这回答了可见元素和隐藏元素之间可能存在差异的原因。至于换行的数量，确定字符串中有多少个换行的算法是在standard page上递归定义的，这很令人困惑，这就是为什么我建议不要将逻辑基于行为此功能。 innerText只是一个近似值。

我建议看看textContent，它不受CSS的影响。

所以总结一下这个长解释：

是的，display: none确实会影响innerText
根据您的目标，我可能会使用foo.textContent.replace(/\s+/g, " ")。

Answer 2

因此，根据出色的Jacque Goupil answer，我创建了自己的解决方法。它使用innerHTML。

算法：

获取元素的innerHTML。
删除实体。
删除HTML内容（标签等）。
用单个空格替换多个空格。
用分隔符替换单词之间的空格。

警告：

这是解决方法。
非常慢，并且不适合常规使用！
它是使用正则表达式解析HTML。这确实很危险，可能会破坏所有内容。 确保正则表达式适合您的HTML结构。

代码：

/**
 * Returns a text value of the element (and it's childs).
 *
 * @param dcmnt {Document}
 * The `document` where an element will be searched for.
 *
 * @param selector {string}
 * A selector by which will be search.
 *
 * @param separator {string}
 * A separator between the text of an different elements.
 * Defaults to `" "` (one space).
 *
 * @returns {string}
 * If the element was found, then it's text value, else an empty string.
 *
 * Warning!
 * 
 * This method is pretty slow, because it parse HTML slice,
 * not just gets a text value. It is necessary because of elements
 * that was not rendered (i.e. that have `display: none`).
 * `innerText` and `textContent` will return inappropriate result
 * for this kind elements.
 * For more see:
 *
 * @see https://stackoverflow.com/questions/52480730/replace-n-in-non-render-non-display-element-text
 */
function getTextValue(dcmnt, selector, separator) {
    separator = separator || " ";
    const element = dcmnt.querySelector(selector);

    if (!element) {
        return "";
    }

    /**
     * @see https://stackoverflow.com/questions/7394748/whats-the-right-way-to-decode-a-string-that-has-special-html-entities-in-it#7394787
     */
    const _decodeEntities = (html) => {
        const textArea = document.createElement("textarea");
        textArea.innerHTML = html;

        return textArea.value;
    };

    let innerHTML = element.innerHTML;

    // remove entities from HTML, but keep tags and other stuff.
    innerHTML = _decodeEntities(innerHTML);

    // replace HTML stuff with a space.
    // @see https://stackoverflow.com/questions/6743912/get-the-pure-text-without-html-element-by-javascript#answer-6744068
    innerHTML = innerHTML.replace(/<[^>]*>/g, " ");

    // replace multiple spaces with a single space.
    innerHTML = innerHTML.replace(/\s+/g, " ");

    // remove space from beginning and ending.
    innerHTML = innerHTML.trim();

    // for now there only one space between words.
    // so, we replace a space with the separator.
    innerHTML = innerHTML.replace(/ /g, separator);

    return innerHTML;
}

Gist。

在非渲染（非显示）元素文本中替换↵（\ n）

2 个答案: