Question

我有一个JSON文件，其中包含900篇文章的元数据。我想删除除包含URL的行以外的所有数据，然后将文件另存为.txt。我创建了以下代码，但无法继续保存阶段：

import re

with open("path\url_example.json") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         print(urls)

部分结果：

['http://www.google.com.']
['https://www.tutorialspoint.com']

另一个问题是结果标记在[' ']之间，并且可能以.结尾，我不需要这个。我的预期结果是：

 http://www.google.com
 https://www.tutorialspoint.com

Answer 1

没有关于您拥有的文件（txt，json？）和正在循环通过的输入行的更多信息，这里没有re.findall（）的简单尝试。

with open("path\url_example.txt") as handle:
    for line in handle:
        if not re.search('http'):
            continue
        spos = line.find('http')
        epos = line.find(' ', spos)
        url = line[spos:epos]
        print(url)

Answer 2

如果您知道将在JSON中找到哪个URL，您可能会发现一种更简单的方法是使用Python标准库中的JSON模块反序列化JSON，并使用字典而不是使用正则表达式。

但是，如果您想使用正则表达式，请记住 urls 是正则表达式匹配项的列表。如果您知道每行肯定只有一个匹配项，那么只需打印第一个条目，然后将其从终端“。”剥离即可。

import re

with open("path\url_example.txt") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         print(urls[0].rstrip('.'))

如果您希望每行看到多个匹配项：

import re

with open("path\url_example.txt") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         for url in urls:
             print(url.rstrip('.'))

使用python删除除URL以外的所有内容

2 个答案: