Question

我有这样的网址：

a）<a href=\"http://example.com/path-pattern-to-match/subPath/onemoreSubpath/arbitrary-number-of-subpaths/someArticle1\">

或：

b）<a href=\"http://example.com/path-pattern-to-match/someArticle2\">

我需要将路径模式与其基本网址分开，<a>标记的开头，并将其与Iits someArticle连接起来。中间的一切都需要删除。

案件'b'仍未受影响。案例'a'需要成为：

<a href=\"http://example.com/path-pattern-to-match/someArticle1\">

请用RegEx回答，这就是我需要的。如果使用Perl或bash脚本进行详细解释，其他解决方案可能会很有趣，但请避免建议一些编程模块或函数来解析它，只是说RegEx不是最好的解决方案而且没有任何真正的解决方案。

PS：我需要解析非多行文件。 someArticle是可变的。

Answer 1

如果您有后备支持，请使用

(?<=<a href=\\"http:\/\/example\.com\/path-pattern-to-match\/)(?:[^\/]+\/)*([^\/>"]*)(?=\\">)

请参阅demo

<强>说明

(?<=<a href=\\"http:\/\/example\.com\/path-pattern-to-match\/) - 固定宽度的后视图，确保我们前面有<a href=\"http://example.com/path-pattern-to-match/个文字文字...
(?:[^\/]+\/)* - 除/（[^\/]+）以外的0个或多个字符的0个或多个序列，后跟文字/（即子路径）
([^\/>"]*) - 与我们的关键字“someArticle”匹配的捕获组（除"，>或/以外的0个或多个字符。
(?=\\">) - 一个正向前瞻，检查前一个子模式后是否有\">。

使用$1替换字符串，您可以删除子路径并保留“someArticle”部分。