如何从庞大的一行文本文件中提取URL?

时间:2019-04-12 13:32:58

标签: url text copy notepad++ line

我有一个文本文件,我想从中提取链接。

问题在于文本文件只有一行,而且链接很多!

或者,当我在记事本中打开它时,它会以很多文件的形式显示,但没有组织。

示例文字:

  

[{“参与者”:[“ minanageh379”,“ xcsadc”],“对话”:   [{“ sender”:“ minanageh379”,“ created_at”:   “ 2019-04-12T12:51:56.560361 + 00:00”,“媒体”:   “ https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2”},   {“发送者”:“ minanageh379”,“ created_at”:   “ 2019-04-12T12:51:51.923138 + 00:00”,“ text”:“ sd”},{“ sender”:   “ minanageh379”,“ created_at”:“ 2019-04-12T12:51:41.689524 + 00:00”,   “ text”:“ sdsa”},{“ sender”:“ xcsadc”,“ created_at”:   “ 2019-04-12T12:50:57.283147 + 00:00”,“ text”:“ ‍❤️‍‍”},{“ sender”:   “ xcsadc”,“ created_at”:“ 2019-04-12T12:39:35.248517 + 00:00”,“文本”:   “ czx”},{“ sender”:“ xcsadc”,“ created_at”:   “ 2019-04-12T12:39:34.352752 + 00:00”,“ text”:“ dsad”},{“ sender”:   “ xcsadc”,“ created_at”:“ 2019-04-12T12:39:30.889023 + 00:00”,“媒体”:   “ https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2”},   {“ sender”:“ xcsadc”,“ created_at”:“ 2019-04-12T12:38:54.823472 + 00:00”,   “ text”:“ hi hi hi”}]}]

预期结果

  

https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2

     

https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2

更新的

{"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"},

5 个答案:

答案 0 :(得分:0)

  • Ctrl + H
  • 查找内容:(?:^|\G).*?"media": "(https://[^"]+)(?:(?!https:).)*
  • 替换为:$1\n
  • 检查环绕
  • 检查正则表达式
  • 取消检查. matches newline
  • 全部替换

说明:

(?:^|\G)            # beginning of line OR restart from last match position
.*?                 # 0 or more any character but newline, not greedy
"media": "          # literally
(                   # start group 1
  https://[^"]+     # https:// fllowed by 1 or more not double quote, the url
)                   # end group 1
(?:(?!https:).)*    # Tempered greedy token, make sure we haven't "https" after

替换:

$1         # content of group 1, the URL

给定示例的结果

https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2
https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2

屏幕截图:

enter image description here

答案 1 :(得分:0)

尝试一下:

首先,我们将删除所有不构成有效url加上引号和空格的字符。在某些情况下,这会删除似乎在记事本++上使用boost regexes引起trouble的表情符号。

我们的第一个替代者将是:

搜索:[^a-zA-Z0-9_\-.~:\/?#\[\]@!$&'()*+,;=%"\s]

替换为:(leave empty)

全部替换

(在记事本++的将来版本上可能不需要上一步)

清理后,我们进行以下替换:

搜索:(?i)(?:(?:(?!https?:).(?!https?:))*?"sender"\s*+:\s*+"([^"]*)"|\G)(?:.(?!"sender"\s*+:\s*+))*?(https?:.*?(?=[^a-zA-Z0-9_\-.~:\/?#\[\]@!$&'()*+,;=%]|https?:))|.*

替换:(?{1}\n\n\1\t\2:(?{2}\t\2)

全部替换

即使在其中包含多个URL的“文本”属性中,这也应起作用。这些网址将由制表符分隔。

在对数据应用前面的步骤之后,

[{"participants": ["minanageh379", "xcsadc"], "conversation": [{"sender": "minanageh379", "created_at": "2019-04-12T12:51:56.560361+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2   http://foo.barhttps://bar.foo"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00", "text": "sdsa"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:50:57.283147+00:00", "text": "‍❤️‍‍"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "hi hi hi"}, {"sender": "no_media_no_text", "created_at": "2019-04-12T12:38:54.823472+00:00"}, {"sender": "url_inside_text", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "Hi! {check} this url: \"http://foo.bar\" another url: https://new.url.com/ yet another one: https://google.com/"}, {"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"}, {"sender": "ny", "created_at": "2017-10-22T20:49:50.042588+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}]}]

我们得到:

minanageh379    https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2    http://foo.bar  https://bar.foo

xcsadc  https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2

url_inside_text http://foo.bar  https://new.url.com/    https://google.com/

ncccy   https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2

ny  https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2

如果URL在原始输入中重复(具有相同或不同的属性),则可能会出现重复的URL。

处理后,您可以使用此正则表达式删除重复项:

搜索:(?i)\t(https?:\S++)(?=[^\n]+\1)

替换为:(nothing)

全部替换

答案 2 :(得分:0)

要仅从文本文件中提取链接,请使用以下命令执行正则表达式“全部替换”:

查找内容:

.*?(https?:[^"]+)(?(?!.*?https?:).*)

替换为:

$1\n\n

Demo 1

请注意,如果插入点不在文本开头,则需要检查Wrap around

说明:

.*?(https?:[^"]+)(?(?!.*?https?:).*)
|_||____________||_________________|
 |    ____|               |
 |   |    ________________|
 |   |   |
 |   |  [3] If there are no more following links, grab and discard the rest of the text
 |  [2] Store the link in $1 (starting with http and ending just before the first following")
[1] Grab and discard everything up 'til the first link (i.e. starting with http: or https:)

使用“全部替换”时,搜索和替换会自动继续,直到正则表达式无法匹配为止,从最后一个匹配数据的位置开始直到直到“ til”为止,在这种情况下,该位置恰好是双引号末尾的双引号之前。如果有更多链接,则为当前链接,否则为文本结尾。



要提取发送者,请使用以下命令:

查找内容:

.*?\{(?:([^"]*)"){4}[^{}]*?(https?:[^"]+)(?(?!.*?https?:).*)

替换为:

$1 $2\n\n

Demo 2

说明:

明天过来


另一种正则表达式可以做到这一点,但是可能更清楚一点:

.*?"sender": "([^"]*)[^}]*?(https?:[^"]+)(?(?!.*?https?:).*)

Demo 3

说明:

.*?"sender": "([^"]*)[^}]*?(https?:[^"]+)(?(?!.*?https?:).*)
|_||_________||_____||____||____________||_________________|
 |   ___|  ______|  ___|  _______|  _____________|
 |  |   __|  ______|  ___|  _______|
 |  |  |   _|   _____|  ___|
 |  |  |  |   _|  _____|
 |  |  |  |  |   |
 |  |  |  |  |  [6] If there are no more following links, grab and discard the rest of the text
 |  |  |  | [5] Store the link in $2 (starting with http and ending just before the first following")
 |  |  | [4] Grab and discard everything within the current set of braces up 'til the link
 |  | [3] Store the sender name in $1 
 | [2] Grab and discard "sender": " (i.e. up to the opening quote of the sender name)
[1] Grab and discard everything up 'til the first "sender" key which has an associated link

步骤[1]的工作方式是首先从文本的开头开始,抓取所有内容,直到第一个发件人密钥,然后通过[2]抓取密钥,在[3]中抓取发件人名称,然后抓取所有内容直到关联的链接(如果在[4]中存在)。如果没有关联的链接,则[5]失败,并且正则表达式返回到步骤[1],该步骤将继续抓取从第一个发送者密钥到第二个发送者密钥的所有内容。重复此循环,直到找到具有关联链接的发送方密钥。

这时,步骤[5]成功,然后步骤[6]抓取了其余的文本,或者什么也没有。

最后,所有抓取的文本均替换为$1 $2\n\n,即发件人名称,后跟一个空格,链接,然后是两个换行符。

这完成了第一个“替换”。由于选择了全部替换,因此整个过程将再次开始,但是文本指针将位于先前找到的链接末尾的双引号中,或者位于文本末尾而不是开始处。

答案 3 :(得分:0)

虽然其他答案完全满足您的需要,但要注意的一件事是,您给出的字符串是有效的JSON字符串。您可以验证其为有效的JSON here

如果要在程序中处理此字符串,则可能要考虑为您的语言使用JSON解析器。 Here's the one for Python

答案 4 :(得分:0)

另一种替代方法是解析JSON数据。

您可以使用javascript执行此操作。

以下代码片段应可用于解析数据。它甚至应该在同一条text消息中使用多个URL:

yourJSON
[0].conversation
.filter(x => x.media !== undefined || x.text !== undefined && /https?:/i.test(x.text))
.map(x => {
    const tmp = x.text + ' ' + x.media;
    const urls = tmp.match(/https?:[\w\-.~:\/?#\[\]@!$&'()*+,;=%]*/g);
    return x.sender + ":\n" + urls.join("\n");
})
.join("\n\n");

您可以将该javascript(将您的数据更改为yourJSON)粘贴到具有javascript控制台(如Firefox或Chrome)的浏览器中。在firefox中,您可以使用(Control + Shift + K)启动控制台,在Chrome中使用(Control + Shift + I,然后单击“控制台”)

或者,您也可以使用此jsfiddle

编辑javascript方块以使用您的数据,然后按“运行”按钮。