使用Python正则表达式捕获posessives和前缀

时间:2018-03-22 23:04:27

标签: python regex

我正在尝试为Python编写一个正则表达式来捕获出现在语料库中的各种形式的“群岛”。

这是一个测试字符串:

This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'

我想从字符串中捕获以下内容:

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

尝试1

使用正则表达式(archipelag.*?)\b并使用Pythex进行测试,我捕获了所有六种形式的部分内容。但是有一些问题:

  1. archipelago's仅作为archipelago被捕获。我想获得占有欲。
  2. meta-archipelagic仅作为archipelagic被捕获。我希望能够捕获带连字符的前缀。
  3. protoarchipelagic仅作为archipelagic被捕获。我希望能够捕获非连字符的前缀。
  4. 尝试2

    如果我尝试使用正则表达式(archipelag.*?)\s(请参阅Pythex),现在会捕获占有archipelago's,但也会捕获第一个实例后面的逗号(例如,{{ 1}})。它无法完全捕获最终的archipelagos,

3 个答案:

答案 0 :(得分:1)

正则表达式(?:)适用于此。如果您有其他要求,可能需要进一步修改。

请注意使用非捕获组?对表达式进行分组,以便我们可以使用import re pat = re.compile(r"((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)") corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'" for match in pat.findall(corpus): print(match) 匹配零个或一个

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

打印

var data = sparkSession.read.parquet("s3://data-location")
var window = Window.rowsBetween(-250, Window.currentRow - 1).partitionBy("ID").orderBy("DATE")
data.withColumn("Feature_1", col("Feature_1").divide(avg("Feature_1").over(window))).write.parquet("s3://data-out")

Here it is on regex101

答案 1 :(得分:1)

Just make more concrete regex. This one could help: \b([a-zA-Z-]*archipelag[a-zA-Z']+)\b

Explanation:

  • \b assert position at a word boundary
  • [a-zA-Z-]* matches zero or many from letters or -
  • [a-zA-Z-]+ matches one or many from letters or '

You can check it here

答案 2 :(得分:1)

Tried this one and it worked:

[a-zA-Z-]*arch[a-zA-Z']*