我正在尝试为Python编写一个正则表达式来捕获出现在语料库中的各种形式的“群岛”。
这是一个测试字符串:
This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'
我想从字符串中捕获以下内容:
archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes
使用正则表达式(archipelag.*?)\b
并使用Pythex进行测试,我捕获了所有六种形式的部分内容。但是有一些问题:
archipelago's
仅作为archipelago
被捕获。我想获得占有欲。meta-archipelagic
仅作为archipelagic
被捕获。我希望能够捕获带连字符的前缀。protoarchipelagic
仅作为archipelagic
被捕获。我希望能够捕获非连字符的前缀。如果我尝试使用正则表达式(archipelag.*?)\s
(请参阅Pythex),现在会捕获占有archipelago's
,但也会捕获第一个实例后面的逗号(例如,{{ 1}})。它无法完全捕获最终的archipelagos,
。
答案 0 :(得分:1)
正则表达式(?:)
适用于此。如果您有其他要求,可能需要进一步修改。
请注意使用非捕获组?
对表达式进行分组,以便我们可以使用import re
pat = re.compile(r"((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)")
corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'"
for match in pat.findall(corpus):
print(match)
匹配零个或一个
archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes
打印
var data = sparkSession.read.parquet("s3://data-location")
var window = Window.rowsBetween(-250, Window.currentRow - 1).partitionBy("ID").orderBy("DATE")
data.withColumn("Feature_1", col("Feature_1").divide(avg("Feature_1").over(window))).write.parquet("s3://data-out")
答案 1 :(得分:1)
Just make more concrete regex. This one could help:
\b([a-zA-Z-]*archipelag[a-zA-Z']+)\b
Explanation:
\b
assert position at a word boundary[a-zA-Z-]*
matches zero or many from letters or -
[a-zA-Z-]+
matches one or many from letters or '
You can check it here
答案 2 :(得分:1)
Tried this one and it worked:
[a-zA-Z-]*arch[a-zA-Z']*