这是我数据框中pyspark列(字符串)的一个小例子。
column | new_column
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy es día de ABC/KE98789T983456 clase. | 98789
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Como ABC/KE 34562Z845673 todas las mañanas | 34562
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy tiene ABC/KE 110330/L63868 clase de matemáticas, | 110330
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC 898456/L56784 levanta con sueño. | 898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC898456 levanta con sueño. | 898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
comienza ABC - KE 60014 -T60058 | 60014
------------------------------------------------------------------------------------------------- |--------------------------------------------------
inglés y FOR 102658/L61144 ciencia. Se viste, desayuna | 102658
------------------------------------------------------------------------------------------------- |--------------------------------------------------
y comienza FOR ABC- 72981 / KE T79581: el camino hacia la | 72981
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su | [101665,103035,101926,105484,103036,103247]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela ABCS 206048/206049/206050/206051/205225-FG-matemáticas- | [206048,206049,206050,206051,205225]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
encuentra ABCS 111553/L00847 & 111558/L00895 - matemáticas | [111553, 111558]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna | [163278,164567,164568 ]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T | [71729,71781,71782,71730,71783,71784]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593 | [2646]
-----------------------------------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345 | [6048,206049,6050,206051,205225]
-----------------------------------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS) | [109038]
-----------------------------------------------------------------------------------------------------------------------------------------------------
y comienza FOR ABC- 2981 / KE T79581: el camino hacia la 9856 | [2981]
我想从此文本中提取所有包含4、5或6位数字的数字。 提取它们的条件和案例:
- Attached to ABC/KE (first line in the example above).
- after ABC/KE + space (second and third line).
- after ABC + space (line 4)
- after ABC without space (line 5)
- after ABC - KE + space
- after for word
- after ABC- + space
- after ABC + space
- after ABCS (line 10 and 11)
失败案例示例:
Column | new_column
------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS) | [1618] ==> should be [109038]
------------------------------------------------------------------------------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593 | [1593] ==> should be [2646]
------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345 | [6048,206049,6050,206051,205225, 2345] ==> should be [6048,206049,6050,206051,205225]
我希望我能恢复案件,您可以在上面看到我的示例以及期望的输出。 我该怎么做 ? 谢谢
答案 0 :(得分:3)
一种使用正则表达式清除数据并设置值为ABC
的单独锚来标识潜在匹配的开始的方法。在str.split()之后,遍历结果数组以进行标记并检索此锚点之后的连续匹配数字。
编辑:在数据模式(_
)中添加了下划线\b(\d{4,6})(?=[A-Z/_]|$)
,以使下划线作为锚点可以跟随匹配的4到6位数的子字符串。这样就固定了第一行,第2行和第3行应该与现有的regex模式一起使用。
import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf
(1)使用正则表达式模式清除原始数据,以便我们只有一个锚点ABC
来标识潜在匹配的开始:
clean1 :使用[-&\s]+
将'&','-'和空格转换为空格' '
,它们用于连接数字链
example: `ABC - KE` --> `ABC KE`
`103035 - 101926 - 105484` -> `103035 101926 105484`
`111553/L00847 & 111558/L00895` -> `111553/L00847 111558/L00895`
clean2 :将与以下三个子模式匹配的文本转换为'ABC '
+ ABCS?(?:[/\s]+KE|(?=\s*\d))
+ ABC followed by an optional `S`
+ followed by at least one slash or whitespace and then `KE` --> `[/\s]+KE`
example: `ABC/KE 110330/L63868` to `ABC 110330/L63868`
+ or followed by optional whitespaces and then at least one digit --> (?=\s*\d)
example: ABC898456 -> `ABC 898456`
+ \bFOR\s+(?:[A-Z]+\s+)*
+ `FOR` words
example: `FOR DEF HJK 12345` -> `ABC 12345`
数据: \ b(\ d {4,6})(?= [AZ / _] | $)是与实际匹配的正则表达式数字:4-6位数字,后跟[AZ /]或end_of_string
(2)创建一个字典以保存所有3种模式:
ptns = {
'clean1': re.compile(r'[-&\s]+', re.UNICODE)
, 'clean2': re.compile(r'\bABCS?(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*', re.UNICODE)
, 'data' : re.compile(r'\b(\d{4,6})(?=[A-Z/_]|$)', re.UNICODE)
}
(3)创建一个函数以查找匹配的数字并将其保存到数组中
def find_number(s_t_r, ptns, is_debug=0):
try:
arr = re.sub(ptns['clean2'], 'ABC ', re.sub(ptns['clean1'], ' ', s_t_r.upper())).split()
if is_debug: return arr
# f: flag to identify if a chain of matches is started, default is 0(false)
f = 0
new_arr = []
# iterate through the above arr and start checking numbers when anchor is detected and set f=1
for x in arr:
if x == 'ABC':
f = 1
elif f:
new = re.findall(ptns['data'], x)
# if find any matches, else reset the flag
if new:
new_arr.extend(new)
else:
f = 0
return new_arr
except Exception as e:
# only use print in local debugging
print('ERROR:{}:\n [{}]\n'.format(s_t_r, e))
return []
(4)定义udf函数
udf_find_number = udf(lambda x: find_number(x, ptns), ArrayType(StringType()))
(5)获得new_column
df.withColumn('new_column', udf_find_number('column')).show(truncate=False)
+------------------------------------------------------------------------------------------+------------------------------------------------+
|column |new_column |
+------------------------------------------------------------------------------------------+------------------------------------------------+
|Hoy es da de ABC/KE98789T983456 clase. |[98789] |
|Como ABC/KE 34562Z845673 todas las ma?anas |[34562] |
|Hoy tiene ABC/KE 110330/L63868 clase de matem篓垄ticas, |[110330] |
|Marcos se ABC 898456/L56784 levanta con sue?o. |[898456] |
|Marcos se ABC898456 levanta con sue?o. |[898456] |
|comienza ABC - KE 60014 -T60058 |[60014] |
|ingl篓娄s y FOR 102658/L61144 ciencia. Se viste, desayuna |[102658] |
|y comienza FOR ABC- 72981 / KE T79581: el camino hacia la |[72981] |
|escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su|[101665, 103035, 101926, 105484, 103036, 103247]|
|escuela ABCS 206048/206049/206050/206051/205225-FG-matem篓垄ticas- |[206048, 206049, 206050, 206051, 205225] |
|encuentra ABCS 111553/L00847 & 111558/L00895 - matem篓垄ticas |[111553, 111558] |
|ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna |[163278, 164567, 164568] |
|ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T |[71729, 71781, 71782, 71730, 71783, 71784] |
+------------------------------------------------------------------------------------------+------------------------------------------------+
(6)用于调试的代码,使用find_number(row.column, ptns, 1)
检查前两个正则表达式模式如何/是否按预期工作:
for row in df.limit(10).collect():
print('{}:\n {}\n'.format(row.column, find_number(row.column, ptns)))
一些注意事项:
以clean2
模式,ABCS和ABS的处理方式相同。如果它们不同,则只需删除'S'并在模式末尾添加新的替代项ABCS\s*(?=\d)
re.compile(r'\bABC(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*|ABCS\s*(?=\d)')
当前模式clean1
仅将'-','&'和空格视为连续的连接符,您可以添加更多字符或单词,例如'and','or',例如:
re.compile(r'[-&\s]+|\b(?:AND|OR)\b')
FOR words
是\ bFOR \ s +(?:[A-Z] + \ s +)*,可能会根据单词中是否允许数字等进行调整。
这已在Python-3上进行了测试。使用Python-2,Unicode可能会出现问题,您可以使用reference