我有很多替换模式,我需要进行文本清理。我出于性能原因从数据库加载数据并编译正则表达式。 不幸的是,在我的方法中,只有变量“text”的最后一个赋值似乎是有效的,而其他变量似乎被覆盖了:
# -*- coding: utf-8 -*-
import cx_Oracle
import re
connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")
# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')
def extract_from_db():
text = ''
for row in cursor:
# sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
# ..
text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
print text
extract_from_db()
有谁知道如何以优雅的方式解决这个问题?或者我是否必须通过巨大的if / elif控制结构来解决这个问题?
答案 0 :(得分:7)
您继续使用str(row[0])
上的替换结果替换最后一个结果。使用text
代替累积替换:
text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)
您最好使用实际列表:
REPLACEMENTS = [
(re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
(re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
# ..
(re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
(re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]
并在循环中使用它们:
text = str(row[0])
for pattern, replacement in REPLACEMENTS:
text = pattern.sub(replacement, text)
或使用functools.partial()
进一步简化循环:
from functools import partial
REPLACEMENTS = [
partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
# ..
partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]
和循环:
text = str(row[0])
for replacement in REPLACEMENTS:
text = replacement(text)
或使用partial()
个对象中包含的上述模式列表,以及reduce()
:
text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])
答案 1 :(得分:1)
你的方法很好;但是,在每一行上,您都将正则表达式应用于原始字符串。您需要将它应用于上一行的结果,即:
def extract_from_db():
text = ''
for row in cursor:
# sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
# This one stays the same - initialize from the row
text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
# For these, route text back into it
text = REPLACE_2.sub(r'REPLACE_2',text)
# ..
text = REPLACE_99.sub(r'REPLACE_99',text)
text = REPLACE_100.sub(r'REPLACE_100',text)
print text
答案 2 :(得分:1)
看起来你需要的是:
text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
text = REPLACE_2.sub(r'REPLACE_1',text)
# ..
text = REPLACE_99.sub(r'REPLACE_99',text)
text = REPLACE_100.sub(r'REPLACE_199',text)
答案 3 :(得分:1)
我可以建议建立一个模式列表及其替换值,然后迭代它吗?然后,每次要更新模式时都不必修改函数:
import cx_Oracle
import re
connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")
REPLACEMENTS = [
(re.compile(r'(sample_pattern_1)'), 'REPLACE_1'),
(re.compile(r'(sample_pattern_2)'), 'REPLACE_2'),
# ..
(re.compile(r'(sample_pattern_99)'), 'REPLACE_99'),
(re.compile(r'(sample_pattern_100)'), 'REPLACE_100'),
]
def extract_from_db():
for row in cursor:
text = str(row[0])
for pattern, replacement in REPLACEMENTS:
text = pattern.sub(replacement, text)
print text
extract_from_db()