Question

今天我是正常表达的新人。

我需要解析数百万行才能找到网址。我做了那个RE：

https?:\\*?\/*?([A-Z]|[a-z]|[0-9]|[\\ \/.,;è^!@#$%?&*()_\-\+\[\]\.=])*

当我访问检查正则表达式的网站时，它运行正常！以下是我解析的文字：

commandes / 2017-07-31.json：＆＃34; url_private＆＃34;：＆＃34; HTTPS：？//files.slack.com/files-pri/T0FF4V85AAL-F6H1FF1RS2J/c-20F170731-1.xlsx吨= xoxe-499418036ds0-369634711108-369165794800-19186566d70d354163357sfdsf8337c086be2＆＃34 ;, commandes / 2017-07-31.json:" url_private_download＆＃34;：＆＃34; HTTPS：？//files.slack.com/files-pri/T04VDF85AAL-F6DDH11RS2J/download/c-20170731-1.xlsx吨= xoxe-4994180360-369634711108-369165794800-19186566d70d3541633578fds234337c086be2＆＃34 ;,

import re
import tkinter as tk
from tkinter import filedialog


def raise_above_all(window):
    window.attributes('-topmost', 1)


a = input("press enter to select the file to parse")
if a == "":
    root = tk.Tk()
    root.withdraw()
    raise_above_all(root)
    file_path = filedialog.askopenfilename()

try:
    fileurl = open(file_path, 'r')
    fileurl = (fileurl.read())
except Exception:
    print("An error has occurred, verify that the path exists or that the file extension is parsable")
    exit(1)

Urls = re.findall(r"https?:[\\/]*([a-z])*", fileurl)
print(Urls)

我一步一步地注意到，当我把（）放在这些[]周围时，它只会让我回到原来的状态。这是word文件的最后一个字符。当我插入所有正则表达式时，它会返回所有＆＃39; 2＆＃39;因为它是我所有网址的最后一个字符。我不明白为什么我只获得最后一个字符而不是整个字符串。你能帮我吗？将文本保存在某个.txt文件中然后运行代码，您将获得一个对话框来帮助您。

Python：正则表达式只返回最后一个字符

0 个答案: