Question

我有一个文本文件，其中包含一些HTML信息：

grep -oP "https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)" list > links

我使用list提取链接，以便Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students是html文件。从另一方面，我需要提取每个文件的名称，即我需要另一个这样的列表：

<a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning with Applications in R</a>

问题是我有一些像a这样的标签，因此我无法使用$1标签的某些模式。所以我必须使用像模式分组这样的东西，我可以使用一些$2作为第一个匹配模式，https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)/[SOME INFORMATION ON URL HERE]/([A-Za-z0-9-_]+)作为第二个模式，依此类推到show{reify{...}}。 如何在终端（Bash）上执行此操作？

Answer 1

您可以使用非贪婪的正则表达式，如下所示：

>([^<]+?)</a>

请参阅Demo

或者更确切地说，您可以使用look-around：

(?<=>)([^<]+?)(?=</a>)

结果：

Lab: K-means Clustering
Lab: Hierarchical Clustering
Interview with John Chambers
Interview with Bradley Efron
Interview with Jerome Friedman
Interviews with statistics graduate students

Answer 2

您可以执行以下操作：

Books

id | publisher_id
-----------------
1  | 1
2  | 1

Publishers    
id | name
----------
1 | publisher1
1 | publisher2

LEFT JOIN output
book_id | publisher_id | name
-----------------------
1 | 1 | publisher1
1 | 1 | publisher2
2 | 1  | publisher1
2 | 1 | publisher2

这将打印：

grep -oP "(?<=\">).*(?=</a)" your_file

由于没有简单的方法可以使用Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students仅打印捕获的组，因此我使用了前瞻和后瞻断言来确保只打印指定的部分。

Answer 3

您可以使用\K在您真正想要的内容之前删除所有匹配的内容

grep -oP "a href=\"[^>]+>\K[^<]+" file

Lab: K-means Clustering
Lab: Hierarchical Clustering
Interview with John Chambers
Interview with Bradley Efron
Interview with Jerome Friedman
Interviews with statistics graduate students

或假设">没有出现在其他任何地方

grep -oP "\">\K[^<]+" file

Answer 4

使用便携式awk解决方案：

awk -F '<a href[^>]*>|</a>' '{print $2}' file.html
Lab: K-means Clustering
Lab: Hierarchical Clustering
Interview with John Chambers
Interview with Bradley Efron
Interview with Jerome Friedman
Interviews with statistics graduate students

在终端上使用正则表达式中的分组提取字符串

4 个答案: