Question

下面是我需要在我作为文本存储的请求中匹配的事项的示例：

[{＆＃34; ID＆＃34;：＆＃34; 896＆＃34;＆＃34;名称＆＃34;：＆＃34; TinyAuras＆＃34;＆＃34; AUTHOR_ID＆＃34 ;: ＃＆34; 654＆＃34;＆＃34;作者＆＃34;：＆＃34; Kurisu＆LT; /跨度＆GT;＆LT; /强＆GT;＆LT; /跨度＆GT;＆LT; / A＆GT;＆＃34;＆＃34 ; githubFolder＆＃34;：＆＃34; HTTPS：//github.com/xKurisu/TinyAuras/blob/master/TinyAuras.csproj"，＆＃34;计数＆＃34;：9，＆＃34; countByChampion＆＃ 34;：{＆＃34;＆＃34;：9，＆＃34;总计＆＃34;：9}，＆＃34;说明＆＃34;：＆＃34;（Beta）Aura / Buff / Debuff Tracker＆＃ 34;，＆＃34; uDate公司＆＃34;：＆＃34; 1451971516＆＃34;＆＃34; createdDays＆＃34;：375，＆＃34;图像＆＃34;：＆＃34; https：//开头CDN .joduska.me / forum / uploads / assemblydb / image-default.jpg＆＃34;，＆＃34; strudate＆＃34;：＆＃34; 2016-07-22 19:40＆＃34;，＆＃34;冠军＆＃34;：空，＆＃34; FORUM_LINK＆＃34;：＆＃34; 165574＆＃34;＆＃34; assembly_compiles＆＃34;：真，＆＃34;投＆＃34;：假，＆＃34; voted_champions＆＃34;：[]}，

我想在此处选择直到停止的链接（基本上是github文件夹，而不是实际的csproj）。

我有一个包含数千个文件的文件，我试图提取所有这些链接并将它们放在文本文件中。

以下是我到目前为止perl正则表达式： (?<=githubFolder":").*(?=\/.+\.csproj")但是在第一场比赛后最终选择的比我需要的多。有什么建议吗？

问题是，我希望在this.csproj之前做好一切。

所以在我的例子中我想提取： https://github.com/xKurisu/TinyAuras/blob/master/

Answer 1

这个正则表达式：

"githubFolder":"([^"]*/)[^"/]*"

选择

https://github.com/xKurisu/TinyAuras/blob/master/

在你的例子中。

然而，使用实际的json解析器可能会更好，因为Jim D.答案建议所以你不必担心间距和特殊字符。

Answer 2

虽然接受的答案可能会在这里完成工作，但我只想指出旧的学校linux工具不容易使用以获得使用JSON的100％准确结果，因此，它将是使用实际JSON解析器提取内容的最佳实践。

一个简单的原因是字符串是JSON编码的，所以你需要以某种方式解码它们以确保你得到正确的结果。另一个是JSON不是常规语言，它是无上下文的。一般来说，你需要比正则表达式更强大的东西。

我熟悉的是jq，JSON对象数组可以像OP一样解析：

$ jq -r ' .[] | .githubFolder ' foo
https://github.com/xKurisu/TinyAuras/blob/master/TinyAuras.csproj
https://github.com/xKurisu/"GiantAuras"/blob/master/GiantAuras.csproj
$

文件foo是

[
  {
    "id": "896",
    "name": "TinyAuras",
    "author_id": "654",
    "author": "Kurisu</span></strong></span></a>",
    "githubFolder": "https://github.com/xKurisu/TinyAuras/blob/master/TinyAuras.csproj",
    "count": 9,
    "countByChampion": {
      "": 9,
      "total": 9
    },
    "description": "(Beta) Aura/Buff/Debuff Tracker",
    "udate": "1451971516",
    "createdDays": 375,
    "image": "https://cdn.joduska.me/forum/uploads/assemblydb/image-default.jpg",
    "strudate": "2016-07-22 19:40",
    "champions": null,
    "forum_link": "165574",
    "assembly_compiles": true,
    "voted": false,
    "voted_champions": []
  },
  {
    "id": "888",
    "name": "\"GiantAuras\"",
    "author_id": "666",
    "author": "Astaire</span></strong></span></a>",
    "githubFolder": "https://github.com/xKurisu/\"GiantAuras\"/blob/master/GiantAuras.csproj",
    "count": 90,
    "countByChampion": {
      "": 777,
      "total": 42
    },
    "description": "(Stable) Aura/Buff/Debuff Tracker",
    "udate": "1451971517",
    "createdDays": 399,
    "image": "https://cdn.joduska.me/forum/uploads/assemblydb/image-default.jpg",
    "strudate": "2016-07-22 19:40",
    "champions": null,
    "forum_link": "165574",
    "assembly_compiles": true,
    "voted": false,
    "voted_champions": []
  }
]

Answer 3

这是正则表达式：

("githubFolder":".*)\/(.*\.csproj)

1. "githubFolder":"https://github.com/removed/removed/blob/master/stophere/this.csproj      
    1.1. Group: "githubFolder":"https://github.com/removed/removed/blob/master/stophere
    1.2. Group: this.csproj

您可以在此处测试：http://www.regexe.com

Answer 4

此模式：(http|https):\/\/github\.com\/[\w\/]+\/选择所有以github.com开头的目录。

Answer 5

试试这个RegEx：

githubFolder":"([a-zA-Z:\/.]+\/)

它会将链接分组到最后一个斜杠。

正则表达式匹配两个分隔符之间的文本？

5 个答案: