Question

我正在尝试为NLP预处理文本文件，为此，我们正在标记各种项目，例如日期，地址和敏感个人信息（SPI）。问题在于文本已经掩盖了其中一些信息。例如：

1月6日，xxxx或（xxx）xxx-1234

我的问题是，是否可以在python中使用正则表达式来取消屏蔽它们，以便我们可以正确地对其进行标记？所以我需要这样的东西：

11月6日，1111年或（111）111-1234

要将它们标记为#US_DATE和#PHONE

我尝试了简单的可能解决方案，例如：

re.sub(r'xx', '11', '(xxx)xxx-1234')
re.sub(r'xx+', '11', 'January 9 xxxx')

但是都没有给我正确的模式！预先感谢。

Answer 1

也许一种选择是使用交替匹配来匹配您具有的不同格式，并使用re.sub和回调将所有>>> cv = CountVectorize() >>> v1 = cv(['my name is banana', 'banana is my name']) >>> v1.todense() ####this not my mean matrix([[1, 1, 1, 1], [1, 1, 1, 1]]) >>> cv2 = myVectorize() >>> v2 = cv2(['my name is banana', 'yes! banana is my name']) >>> v2 ####its my mean matrix([[0, 1, 2, 3, -1], [3, 4, 2, 1, 0]])字符替换为1。

对于模式，我使用character classes和quantifiers来指定允许匹配的内容，但是您可以对其进行更新以使其更加具体。

Regex demo | Python demo

例如：

\b[A-Za-z]{3,} [a-zA-Z\d]{1,2},? [a-zA-Z\d]{4}\b|\([a-zA-Z\d]+\)[a-zA-Z\d]{3}-[a-zA-Z\d]{4}\b

结果

import re

regex = r"\b[A-Za-z]{3,} [a-zA-Z\d]{1,2},? [a-zA-Z\d]{4}\b|\([a-zA-Z\d]+\)[a-zA-Z\d]{3}-[a-zA-Z\d]{4}\b"
test_str = ("Jan 6, xxxx or (xxx)xxx-1234\n"
    "Jan 16, xxxx or (xxx)xxx-1234\n"
    "January 9 xxxx\n"
    "(xxx)xxx-1234")
matches = re.sub(regex, lambda x: x.group().replace('x', '1'),  test_str)
print(matches)

是否可以在python中标记已经屏蔽的文本？

1 个答案: