Question

我刚开始玩Regex并且似乎有点卡住了！我在TextSoap中使用多行编写了批量查找和替换。这是为了清理我有OCR的食谱，因为有成分和方向我不能将“1”改为“1”，因为这可以将“1汤匙”重写为“1” 。汤姆。“

因此我检查了以下两行（可能有额外的行）是否是使用此代码作为查找的下一个连续数字：

^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))

以及以下各项的替换：

$1. $2 $3 $4$5

我的问题是虽然它按照我的意愿工作，但它永远不会执行最后三个数字的任务......

我要清理的文字示例：

1 This is the first step in the list

2 Second lot if instructions to run through
3 Doing more of the recipe instruction

4 Half way through cooking up a storm

5 almost finished the recipe

6 Serve and eat

我希望它看起来像：

1. This is the first step in the list

2. Second lot if instructions to run through

3. Doing more of the recipe instruction

4. Half way through cooking up a storm

5. almost finished the recipe

6. Serve and eat

有没有办法检查上面的一行或两行以向后运行？我看过前瞻和后视，我在那一点上有点困惑。有没有人有办法清理我的编号清单或帮我处理我想要的正则表达式？

Answer 1

dan1111是对的。使用类似的数据可能会遇到麻烦。但鉴于您提供的样本，这应该有效：

^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search

$1. $2\r\n\r\n                 // replace

如果您不使用Windows，请从替换字符串中删除\r。

说明：

^           // beginning of the line
(\d+)       // capture group 1. one or more digits
\s+         // any spaces after the digit. don't capture
([^\r\n]+)  // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture

替换：

$1.       // group 1 (the digit), then period and a space
$2        // group 2
\r\n\r\n  // two EOLs, to create a blank line
          // (remove both \r for Linux)

Answer 2

这个怎么样？

1 Tbsp salt
2 Tsp sugar
3 Eggs

您遇到了正则表达式的一个主要限制：当您的数据无法严格定义时，它们无法正常工作。您可以直观地知道什么是成分，什么是步骤，但要从算法到算法的可靠规则集并不容易。

我建议您考虑一种基于文件中位置的方法。给定的食谱通常将所有食谱格式相同：例如，首先是成分，然后是步骤列表。这可能是一种更容易区分的方法。

正则表达式清理编号列表

2 个答案: