如何使用Python正则表达式从字符串中提取多个模式?

时间:2019-03-14 09:50:34

标签: python regex

https://epolicy.companyname.co.in/PRODUCTNAME/UI/PremiumCalculation.aspx?utm_source=rtb&utm_medium=display&utm_campaign=dbmew-Category-pros&dclid=CO2g3u7Gy98CFUOgaAodUv4E0w

我有数百万个此类URL,我想从中提取两件事。

  1. PRODUCTNAME:始终以https://epolicy.companyname.co.in

  2. 开头
  3. *。aspx:已访问页面

我尝试了以下正则表达式

re.findall('([a-zA-Z]+\.aspx | https://epolicy\.companyname\.co\.in/(.*?)/UI)', URL)

及其一些变体。但这没有用。正确的方法是什么?

2 个答案:

答案 0 :(得分:0)

尝试一下!

代码:

Sub test()

Dim MyPath As String, mps As Variant, mps_temp As String, mydate As Date, i As Integer

MyPath = "G:\Inbox\Folder1\Received\2019 03 01\2019 03 02\2019 03 05\Final"
mps = Split(MyPath, "\")

For i = LBound(mps) To UBound(mps)
    mps_temp = mps(UBound(mps) - i)
    If mps_temp Like "#### ## ##" Then
        mydate = DateSerial(Mid(mps_temp, 1, 4), Mid(mps_temp, 6, 2), Mid(mps_temp, 9, 2))
        Exit For
    End If
Next

msgbox mydate 

End Sub

输出:

import re
url = "https://epolicy.companyname.co.in/PRODUCTNAME/UI/PremiumCalculation.aspx?utm_source=rtb&utm_medium=display&utm_campaign=dbmew-Category-pros&dclid=CO2g3u7Gy98CFUOgaAodUv4E0w"
print(re.findall('https://[^/]*/(.*)/UI/(.*).aspx', url))

答案 1 :(得分:-1)

正则表达式似乎根本不是在这里使用的正确方法。而是解析URL,分割路径,并获取第一个和最后一个元素。

from urllib.parse import urlparse
from pathlib import PurePath

components = urlparse(url)
path = PurePath(url.path)
product_name = path.parts[1]
page = path.stem