Question

可悲的是，谷歌阅读器宣布将在今年年中关闭。由于我在Google阅读器中有大量已加星标的项目，因此我想备份它们。这可以通过谷歌阅读器外卖。它以JSON格式生成文件。

现在我想从这几个MB大文件中提取所有文章网址。

起初我认为最好使用正则表达式来获取网址，但似乎最好通过正则表达式提取所需的文章网址来查找文章网址。这样可以防止提取其他不需要的网址。

以下是json文件部分外观的简短示例：

"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
  "href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
  "type" : "text/html"
} ],

我只需要你可以在这里找到的网址：

 "canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],

也许有人有心情说出正则表达式如何提取所有这些网址？

这样做的好处是可以快速而肮脏地从Google阅读器中提取已加星标的项目网址，以便在处理后将其导入口袋或evernote等服务中。

Answer 1

我知道你问过正则表达式，但我认为有更好的方法来处理这个问题。多行正则表达式是PITA，在这种情况下，不需要那种脑损伤。

我会从grep开始，而不是正则表达式。 -A1参数表示＆＃34;返回匹配的行，然后返回＆＃34;：

grep -A1 "canonical" <file>

这将返回如下行：

"canonical" : [ {
    "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

然后，我再次为了href：

grep -A1 "canonical" <file> | grep "href"

给

"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

现在我可以使用awk来获取网址：

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }'

删除网址上的第一个引号：

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

现在我只需要摆脱额外的引用：

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'

那就是它！

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/

正则表达式从Google Reader JSON文件中提取所有已加星标的项目URL

1 个答案: