Question

我正在尝试处理一些数据 - 特别是我必须

删除文件中所有数字的任何小数，例如4.0 -> 4
在任何日期和任何时间之间添加短划线，例如2014-01-01 23:45:52 -> 2014-01-01-23:45:52

我使用find和replace函数在sublime文本中编写了一些正则表达式：

查找："\.\d"，替换：""
查找："(\d{2})\s(\d)"，替换："$1-$2"

这一切都很好，给了我正确的结果。问题是我必须以这种方式处理数百个csv文件，我已经尝试在python中执行它，但它没有按照我期望的方式工作。这是使用的代码：

for file in csv_list: # csv_list is the list of all the files I need to process
with open(file, "r") as infile:
    with open("{}EDIT.csv".format(file.split(".")[0]), "w", newline="") as outfile: # Save the processed version
        writer = csv.writer(outfile, delimiter=",")
        reader = csv.reader(infile)
        for line in reader:
            writer.writerow([re.sub("(\d{2})\s(\d)",
                            "$1-$2", re.sub("\.\d", "", string)) for string in line])

我对正则表达式不太自信，所以我不明白为什么这不符合我的预期。如果有人能帮助我，那就太好了。提前谢谢！

根据要求，这是一个输入行，我期待的输出，实际输出是什么：

input : 0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active
desired output : 0,2013-01-01-20:59:39,5737,english,2013-01-01-21:01:07,active
actual output : 0, 2013-01-$1-$20:59:39,5737,english,2013-01-$1-$21:01:07

Answer 1

您可以使用r"\1-\2"替换第一个正则表达式模式来解决您的问题：

import re
rx = r"(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub("(\d{2})\s(\d)", r"\1-\2", re.sub(r"\.\d", "", s))
print (result)

请参阅Python demo。请参阅re.sub reference：

反向引用（例如\6）将替换为模式中第6组匹配的子字符串。

或者，为了避免使用字符串替换反向引用，请为该任务使用单个正则表达式并修改lambda表达式中的匹配项：

import re
pat = r"\.\d|(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub(pat, lambda m: r"{}-{}".format(m.group(1),m.group(2)) if m.group(1) else "", s)
print (result)

请参阅another Python demo。

请注意，为了更好的安全性，您可以使用r'\.\d+\b'作为删除小数部分的模式（\d+匹配一个或多个数字，\b需要除字母以外的字符，数字或_之后，或字符串的结尾）。第二种模式可以拼写出与r'(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})'相同的目的。

自动化正则表达式来处理多个文件

1 个答案: