价格的正则表达式不起作用

时间:2018-06-25 10:18:12

标签: python regex

我需要一个正则表达式,它匹配任何数字,后跟一个由数字,空格,点和逗号组成的字符串,后跟“Kč”或“ Eur”。

问题是我的 <body> <div class="container"> <div class="text"> <div class="text-row"> <span>W</span> <span>e</span> <span>&nbsp;</span> <span>P</span> <span>r</span> <span>o</span> <span>v</span> <span>i</span> <span>d</span> <span>e</span> </div> </div> </div> </body>有时找不到所有这样的字符串。

Java Monitor Blocked
 at java.security.Provider.getService(String, String)
 at sun.security.jca.ProviderList$ServiceList.tryGet(int)
 at sun.security.jca.ProviderList$ServiceList.access$200(ProviderList$ServiceList, int)
 at sun.security.jca.ProviderList$ServiceList$1.hasNext()
 at javax.crypto.KeyGenerator.nextSpi(KeyGeneratorSpi, boolean)
 at javax.crypto.KeyGenerator.<init>(String)
 at javax.crypto.KeyGenerator.getInstance(String)
 at sun.security.ssl.JsseJce.getKeyGenerator(String)
 at sun.security.ssl.HandshakeMessage$Finished.getFinished(HandshakeHash, int, SecretKey)
 at sun.security.ssl.HandshakeMessage$Finished.<init>(ProtocolVersion, HandshakeHash, int, SecretKey, CipherSuite)
 at sun.security.ssl.ServerHandshaker.sendChangeCipherAndFinish(boolean)
 at sun.security.ssl.ServerHandshaker.clientHello(HandshakeMessage$ClientHello)
 at sun.security.ssl.ServerHandshaker.processMessage(byte, int)
 at sun.security.ssl.Handshaker.processLoop()
 at sun.security.ssl.Handshaker.process_record(InputRecord, boolean)
 at sun.security.ssl.SSLSocketImpl.readRecord(InputRecord, boolean)
 at sun.security.ssl.SSLSocketImpl.performInitialHandshake()
 at sun.security.ssl.SSLSocketImpl.startHandshake(boolean)
 at sun.security.ssl.SSLSocketImpl.startHandshake()
 at org.subethamail.smtp.command.StartTLSCommand.execute(String, Session)
 at org.subethamail.smtp.server.CommandHandler.handleCommand(Session, String)
 at org.subethamail.smtp.server.Session.runCommandLoop()
 at org.subethamail.smtp.server.Session.run()
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run()
 at java.lang.Thread.run()

例如:

val df1 = sourceData.filter($"col1" === "val" and ...)
     .select(...)
     .groupBy(...)
     .min()

val df2 = sourceData.filter($"col2" === "val" and ...)
     .select(...)
     .groupBy(...)
     .count()

不返回任何内容,而不是df1.join(df2, Seq("groupCol"), "full_outer") .join(df3....) .write.save(...)

您知道正则表达式有什么问题吗?

2 个答案:

答案 0 :(得分:5)

您的输入字符串包含一个由基数c和变音符号组成的多字节字母,而正则表达式包含带有Unicode代码点\u010D的预置字母。

您可以使用

(\d(?:[., \d]*\d)?)\s*(K(?:c\u030C|\u010D)|Eur)

(\d[., \d]*)\s*(K(?:č|č)|Eur))

请参见regexsecond regex demo)和Python demo

模式详细信息

  • \d-一个数字
  • (?:[., \d]*\d)?-的可选出现
    • [., \d]*-零个或多个数字,空格,.,
    • \d-一个数字
  • \s*-0个或多个空格
  • (?:K(?:c\u030C|\u010D)|Eur)-K后跟c\u030C\u010DEur值。

定义货币正则表达式时,请使用CZK = ['Czk','K(?:č|č)']CZK = ['Czk', r'K(?:c\u030C|\u010D)']

答案 1 :(得分:3)

正如WiktorStribiżew所说,您的正则表达式中的与文本中的Kč不同。您可以使用unicodedata模块对两者进行标准化:

>>> import re
>>> re.findall("""((\d[., \d]+)(Kč|Eur))""", "Letenky od 12 932 Kč", flags=re.IGNORECASE)
[]
>>> import unicodedata
>>> re.findall(unicodedata.normalize("NFD", """((\d[., \d]+)(Kč|Eur))"""), unicodedata.normalize("NFD", "Letenky od 12 932 Kč"), flags=re.IGNORECASE)
[('12 932 Kč', '12 932 ', 'Kč')]