Question

如何使用正则表达式获取文档中的特定链接？我有一个html文件，其中包含Google驱动器链接以及一堆html代码和其他内容。我试图通过使用RegEx查找它们共有的from pymongo import MongoClient client = MongoClient('mongodb://localhost:27017/') Database = 'Firstclass' def newSearch(Athlete): # STEP 1 db = client[Database] lastDoc = [i for i in db[Athlete].find({},{ '_id': 1, 'Race': 1, 'Avarage': 1}).sort('_id', -1).limit(1)] query = { '$and': [ { 'Average' : {'$gte': lastDoc[0].get('Average')*0.9} }, { 'Average' : {'$lte': lastDoc[0].get('Average')*1.1} } ] } funnel = [x for x in db[Athlete].find(query, {'_id': 1, 'Race': 1}).sort('_id', -1).limit(15)] #STEP 2 compareListID = [] compareListRace = [] for x in funnel: if lastDoc[0].get('_id') != x.get('_id'): compareListID.append(x.get('_id')) compareListRace.append(x.get('Race')) #STEP 3 for y in compareListRace: ED = euclidean_distance(lastDoc[0].get('Race'),y) ESlist.append(ED) #STEP 4 matchObjID = compareListID[numpy.argmax(ESlist)] matchRace = compareListRace[numpy.argmax(ESlist)] newSearch('Jim')

关键字来查找文本中的50个链接。

示例：drive, google, & sharing

我想选择链接的开头和结尾，然后将其全部复制，粘贴到另一个文件中或删除其他内容，然后将这些链接保留在html文档中。

我尝试过

"https://drive.google.com/file/d/1wXbzf0nvddZ0vlz6-fdN7HV/view?usp=sharing"

我尝试了驱动器，结果导致什么都没找到，但是http＆www给出了文件中其他链接的结果，这些链接我没有尝试点击，但是至少显示了一些结果，而不是我列出的特定关键字。

我不确定这是否是解决此问题的正确方法，并且不确定是否应该使用其他方法（例如javascript）来实现此目标等。

我正在Mac上使用Sublime Text尝试解决这个问题。我是正则表达式的新手。

Answer 1

以下应该有效：

.*drive.google.com.*sharing

.表示任何字符
*之前的字符可以出现多次

Answer 2

听起来您正在Mac的某个编辑器中尝试执行此操作，但是问题被标记为“ perl”，因此这是在Perl中执行此操作的一种方法。

首先，有一个完整的示例输入和输出以确保我们了解所需的行为会有所帮助，因此，这里是一个示例输入test.doc：

<p>https://drive.google.com/file/d/0B3GNg0pNzNCWWdFSXNzd00/view?usp=sharing</p><br /><p>https://drive.google.com/sharing/oSmNg0pNzRjWEFyNDRzam8/view?usp=sharing<br /></p></div>
<p>http://drive.google.com/file/d/0B3GNg0pNzNCWWdFSXNzd00/view?usp=sharing</p><br/><p>https://drive.google.com/file/sharing/view?usp=sharing<br /></p></div>
https://drive.abc.com/file/d/efg/view?usp=sharing
https://drive.apple.com/file/d/abc/efg/view?usp=sharing
https://drive.google.com/file/d/xyz/skipme?usp=sharing https://drive.google.com/file/d/ef/view?usp=sharing

我将假定链接用空格或* ml标记<>包围。这是一个Linux单一代码，它将输入test.doc并吐出匹配的html链接。 [^\s<>]+部分将捕获一个或多个非空格\s或<>的字符（即由于[^而导致的否定字符类），以防止其继续运行和匹配同一行中的多个链接：

perl -ne '@m = $_ =~ m{(https?://drive\.google\.com/[^\s<>]+view\?usp=sharing)}g; print "$_\n" for @m;' test.doc

这将给出以下输出：

https://drive.google.com/file/d/0B3GNg0pNzNCWWdFSXNzd00/view?usp=sharing
https://drive.google.com/sharing/oSmNg0pNzRjWEFyNDRzam8/view?usp=sharing
http://drive.google.com/file/d/0B3GNg0pNzNCWWdFSXNzd00/view?usp=sharing
https://drive.google.com/file/sharing/view?usp=sharing
https://drive.google.com/file/d/ef/view?usp=sharing

如果以上内容不能完全满足您的需求，请提供不同的输入/输出文本片段，然后有人会提示您如何更改单线以匹配它。

如何使用RegEx过滤来自html文档的链接？

2 个答案: