Question

我正在尝试解析许多txt文件。以下text只是较大的txt文件的一部分。

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0; text-align: justify">Prior to this primary offering, there has
been no public market for our common stock. We anticipate that the public offering price of the shares will be between $5.00 and
$6.00. We have applied to list our common stock on the Nasdaq Capital Market (&ldquo;Nasdaq&rdquo;) under the symbol &ldquo;HYRE.&rdquo;
If our application is not approved or we otherwise determine that we will not be able to secure the listing of our common stock
on the Nasdaq, we will not complete this primary offering.</P>

我想要的输出：be between $5.00 and and $6.00。因此，我需要提取be between之间的所有内容，直到后面的.（但不考虑小数点5.00！）。我尝试了以下操作（Python 3.7）：

shareprice = re.findall(r"be between\s\$.+?\.", text, re.DOTALL)

但是此代码给了我：be between $5.（在小数点处停止）。我最初在字符串的末尾添加一个\s，要求在.之后留一个空格，该空格将使5.00点保持小数，但是许多其他txt文件没有空格在句子结尾.之后。无论如何，我可以在字符串中指定要在\.之后“跳过”数字吗？

非常感谢。我希望这很清楚。最好

Answer 1

从HTML中解析出纯文本后，您可以考虑匹配尽可能少的0+个字符，后跟一个.，后跟一个数字：

r"be between\s*\$.*?\.(?!\d)"

请参见regex demo。

或者，如果您只想严格忽略两位数之间的点，则可以使用

r"be between\s*\$.*?\.(?!(?<=\d\.)\d)"

请参见this regex demo。 (?!(?<=\d\.)\d)确保将\d\.\d模式跳过到第一个匹配的.，而不仅仅是\.\d。

在两个字符串之间重新查找（但忽略数字）

1 个答案: