Question

初学者RegExp问题。我在文本文件中有JSON行，每个都有稍微不同的Fields，但如果有的话，我想为每行提取3个字段，忽略其他所有字段。我如何使用正则表达式（在编辑板或其他任何地方）执行此操作？

示例：

"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24

我想提取网址，标题，标签，

Answer 1

/"(url|title|tags)":"((\\"|[^"])*)"/i

我认为这就是你所要求的。我会暂时提供一个解释。这个正则表达式（由/分隔 - 您可能不必将它们放在编辑板中）匹配：

文字"。

(url|title|tags)

三个文字字符串“url”，“title”或“tags”中的任何一个 - 在正则表达式中，默认情况下，括号用于创建组，管道字符用于交替 - 如逻辑“或”。要匹配这些文字字符，你必须逃避它们。

":"

另一个文字字符串。

另一组的开头。（第2组）

另一组（3）

\\"

文字字符串\" - 您必须转义反斜杠，否则它将被解释为转义下一个字符，而您永远不会知道它会做什么。

...或

        [^"]

除双引号外的任何单个字符括号表示字符类/集，或匹配的字符列表。任何给定的类都匹配字符串中的一个字符。在类的开头使用克拉（^）否定它，导致匹配器匹配类中未包含的任何内容。

第3组结束......

星号导致前一个正则表达式（在本例中为第3组）重复零次或多次，在这种情况下，使匹配器匹配任何可能在JSON字符串的双引号内的内容。

)"

第2组的结尾和文字"。

我在这里做了一些非显而易见的事情，这可能会派上用场：

第2组 - 使用Backreferences取消引用时 - 将是分配给该字段的实际字符串。这在获取实际值时非常有用。
表达式末尾的i使其不区分大小写。
第1组包含捕获字段的名称。

编辑：所以我看到标签是一个数组。当我有机会思考它时，我会在一秒钟内更新正则表达式。

你的新正则表达式是：

/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i

我在这里所做的只是替换我一直使用的字符串正则表达式（"((\\"|[^"])*)"），带有用于查找数组的正则表达式（\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]）。没有那么容易阅读，是吗？好吧，用我们的String Regex代替字母S，我们可以将其重写为：

\[(S(,S)*)?\]

匹配文字左括号（因此是反斜杠），可选地后跟逗号分隔的字符串列表和结束括号。我在这里介绍的唯一新概念是问号（?），它本身就是一种重复。通常称为“使前一个表达式可选”，它也可以被认为是0或1个匹配。

使用相同的S表示法，这是整个脏正则表达式：

/"(url|title|tags)":(S|\[(S(,S)*)?\])/i

如果有效查看，请点击此处a view of it in action.

Answer 2

这个问题有点老了，但我已经在我的电脑上浏览了一下并找到了表达。我把他作为GIST传递给他人，对他人有用。

编辑：

# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10

(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.\$]*)

# test document
[
  {
    "_id": "56af331efbeca6240c61b2ca",
    "index": 120000,
    "guid": "bedb2018-c017-429E-b520-696ea3666692",
    "isActive": false,
    "balance": "$2,202,350",
    "object": {
        "name": "am",
        "lastname": "lang"
    }
  }
]

Answer 3

为什么它必须是正则表达式对象？

这里我们可以先使用Hash对象，然后再搜索它。

mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}

其输出为

=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}

并不是说我想避免使用Regexp，但是你不认为在你获得想要进一步搜索的数据之前一步一步更容易吗？只是MHO。

mh.values_at(:url, :title, :tags)

输出：

["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]

采用FrankieTheKneeman给你的模式：

pattern = /"(url|title|tags)":"((\\"|[^"])*)"/i

我们可以通过将mh哈希转换为json对象来搜索mh哈希。

/#{pattern}/.match(mh.to_json)

输出：

=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">

当然，这一切都是在Ruby中完成的，这不是你所拥有的标签，但我希望能与之相关。

但是哎呀！看起来我们不能同时使用该模式完成所有这三个模式，所以为了清酒，我会一次只做一个。

pattern = /"(title)":"((\\"|[^"])*)"/i

/#{pattern}/.match(mh.to_json)

#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">

pattern = /"(tags)":"((\\"|[^"])*)"/i

/#{pattern}/.match(mh.to_json)

=> nil

对最后一个感到抱歉。它必须以不同的方式处理。

Answer 4

我使用正则表达式在我自己的库中使用JSON。我在下面详细介绍了算法行为。

首先，对JSON对象进行字符串化。然后，您需要存储匹配的子串的开始和长度。例如：

"matched".search("ch") // yields 3

对于JSON字符串，它的工作方式完全相同（除非您明确搜索逗号和大括号，在这种情况下我会建议您在执行正则表达式之前对JSON对象进行一些先前的转换（即认为：，{， }）。

接下来，您需要重建JSON对象。我创作的算法通过从匹配索引递归地向后检测JSON语法来实现这一点。例如，伪代码可能如下所示：

find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain

使用此信息，可以使用正则表达式过滤JSON对象以返回键，值和父对象链。

您可以在http://json.spiritway.co/

查看我创作的图书馆和代码

Answer 5

请尝试以下表达式：

/"(url|title|tags)":("([^""]+)"|\[[^[]+])/gm

说明：

第一个捕获组 (url|title|tags)：这是交替捕获字符 'url'、'title' 和 'tags'（区分大小写）。

第二个捕获组 ("([^""]+)"|[[^[]+])：

第一个选项 "([^""]+)" 匹配 " 和 " 中的所有单词，包括 " 和 "
第二个选择 [[^[]+] 匹配 [ 和 ] 中的所有单词，包括 [ 和 ]

我已经测试过 here

如何使用正则表达式提取json字段？

5 个答案: