Question

我有一个功能列，其中包含HTML标记。我想删除所有标签。来自“ body”列的一行数据的示例如下：

"<p>Are questions related to and similar products on-topic?</p>"

我希望使用RegexTokenizer（）之后的输出如下：

"are questions related to and similar products on-topic?"

这是我开始的内容：

val regexTokenizer = new RegexTokenizer()
  .setInputCol("body")
  .setOutputCol("removedTags")
  .setPattern("")

我认为我需要修复.setPattern（），但不确定如何修复。

Answer 1

假设您的字符串中可能没有其他<或>，

<[^>]+>

用空字符串替换在某种程度上可以正常运行，otherwise it'd fail。

如果您希望简化/修改/探索表达式，请在regex101.com的右上角进行说明。如果愿意，您还可以在this link中查看它如何与某些示例输入匹配。