Question

我正在尝试为Python编写一个正则表达式来捕获出现在语料库中的各种形式的“群岛”。

这是一个测试字符串：

This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'

我想从字符串中捕获以下内容：

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

尝试1

使用正则表达式(archipelag.*?)\b并使用Pythex进行测试，我捕获了所有六种形式的部分内容。但是有一些问题：

archipelago's仅作为archipelago被捕获。我想获得占有欲。
meta-archipelagic仅作为archipelagic被捕获。我希望能够捕获带连字符的前缀。
protoarchipelagic仅作为archipelagic被捕获。我希望能够捕获非连字符的前缀。

尝试2

如果我尝试使用正则表达式(archipelag.*?)\s（请参阅Pythex），现在会捕获占有archipelago's，但也会捕获第一个实例后面的逗号（例如，{{ 1}}）。它无法完全捕获最终的archipelagos,。

Answer 1

正则表达式(?:)适用于此。如果您有其他要求，可能需要进一步修改。

请注意使用非捕获组?对表达式进行分组，以便我们可以使用import re pat = re.compile(r"((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)") corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'" for match in pat.findall(corpus): print(match)匹配零个或一个

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

打印

var data = sparkSession.read.parquet("s3://data-location")
var window = Window.rowsBetween(-250, Window.currentRow - 1).partitionBy("ID").orderBy("DATE")
data.withColumn("Feature_1", col("Feature_1").divide(avg("Feature_1").over(window))).write.parquet("s3://data-out")

Here it is on regex101

Answer 2

Just make more concrete regex. This one could help: \b([a-zA-Z-]*archipelag[a-zA-Z']+)\b

Explanation:

\b assert position at a word boundary
[a-zA-Z-]* matches zero or many from letters or -
[a-zA-Z-]+ matches one or many from letters or '

You can check it here

Answer 3

Tried this one and it worked:

[a-zA-Z-]*arch[a-zA-Z']*

使用Python正则表达式捕获posessives和前缀

尝试1

尝试2

3 个答案: