Question

我正在读一个包含5列的.csv文件 - C1，C2，C3，C4，C5。

C4包含日期，电话号码，文字等。

现在，我正在尝试编写一个正则表达式来查找日期＆＃39; mm-dd-yy＆＃39;在C4中格式化并将结果输出到文本文件。但是，我的代码没有输出任何文件。我知道在我的输入文件中有日期格式的日期，但似乎有些错误。有什么建议吗？

我的代码：

import re
inputfile = open("train.csv", 'r')
outputfile = open("sample.txt",'w')
for line in inputfile:
    x = re.findall('.*?^([0-9][0-9]-[0-9][0-9]-[0-9][0-9])$.*', line)
    if len(x) != 0:
        print >> outputfile, x

示例train.csv文件格式：

sen_id  word_id type        before      after

1       0       text        On          On

1       1       date        12/2/12     december twelve two thousand twelve

1       2       text        there       there

2       0       text        he          he

2       1       text        was         was

2       2       text        born        born

2       3       date        Jan-12      january two thousand twelve

Answer 1

您可以使用正则表达式来匹配日期，但如果您打算使用它进行任何进一步处理，则最好使用datetime模块。我使用csv和re模块以应该运行的方式更改了代码。

其他生活品质：打开文件时，请使用with进行声明。它负责打开和关闭文件流，否则可能会很头疼。

csv模块还将行分隔为条目列表，这就是为什么row[3]获取第4列而不需要正则表达式的原因。

import csv
import re

date_matcher = re.compile(r'(\d{2}-\d{2}-\d{2})')
with open("sample.txt",'w') as output_file, open("train.csv", 'r') as input_file:
    reader = csv.reader(input_file, delimiter=',', quotechar='"')
    for row in reader:
        for match in date_matcher.finditer(row[3]):
            output_file.write(match.group(0))

修改：将match更改为finditer，没有意识到第四列中会有多个日期。

Answer 2

在代码中使用正则表达式\d{2}-\d{2}-\d{2}：

<强>码

x = re.findall('\d{2}-\d{2}-\d{2}', line)

Answer 3

在这种情况下，细节是魔鬼。

考虑到回答操作的时间太晚了，我的答案是谁可以寻求类似的答案。

您为re.findall()方法提供了字符串文字而不是正则表达式。
在Python中，小写的r表示正则表达式，例如：
'string_literal'-> r'string_as_regex'。

因此（无需验证您的正则表达式，因为任何反对都适用），您可以使用以下任何一种方法：

x = re.findall(r'.*?^([0-9][0-9]-[0-9][0-9]-[0-9][0-9])$.*', line)

或先编译正则表达式，然后像下面这样使用它：

rx = re.compile('regex expression')
match = rx.findall(line)

如您在此处看到的：https://docs.python.org/3.6/library/re.html#re.compile

注意：这将返回匹配的字符串列表（0-n），您最好过滤re.findall()返回的列表中的空字符串。

现在，为您的问题提供实际答案：“我如何仅在第4列上搜索正则表达式？” -其实你不是。
在读取平面文件时（与以前一样），您会获得每个“行”的字符串，因此没有实际的“列”。如果需要，可以使用CSV阅读器单独获取列。

但是，有多种方法可以编写正则表达式，以考虑“列”。这既棘手又脆弱。在您的正则表达式中：

将\s用于可能作为分隔符出现的任何空格在列之间，\s+代表一个或多个（标签可以表示为系列空格）
在部分周围使用分组(...)
如有必要，使用|（OR）运算符进行匹配，例如[0-9]{4}(-|/)[0-9]{2}中的2019-08和2019/08都将匹配

因此，请考虑以下Python正则表达式：

((([0-9])\s+){2}([a-z]{4})\s+){1}([0-9]{2}(/|-)[0-9]{1,2}(/|-)[0-9]{2})

并附带以下代码示例（您可以在Python控制台中直接运行它）：

import re

inp = ['sen_id  word_id type    before  after', 
    '1   0   text    On  On', 
    '1   1   date    12/2/12 december twelve two thousand twelve', 
    '2   1   date    12-2-12 december twelve two thousand twelve', 
    '3  1   date    12-2-12 december twelve two thousand twelve', 
    '1   2   text    there   there', '2   0   text    he  he',]

rx = re.compile('((([0-9])\s+){2}([a-z]{4})\s+){1}([0-9]{2}(/|-)[0-9]{1,2}(/|-)[0-9]{2})')

for line in inp:
    hit = rx.findall(line)
    hit[0] if hit else None

for line in inp:
    hit = rx.match(line)
    hit.groups() if hit else None

for line in inp:
    hit = rx.search(line)
    hit.groups() if hit else None

请注意，数据与示例稍有不同，以演示条件匹配（|）和空格的替换（\s）

每个for循环将返回相同的输出，对于找到匹配项的每一行，将返回一个由7个逗号分隔的值组成的元组，每个值均针对正则表达式中的一组（(...)）：

('1   1   date    ', '1   ', '1', 'date', '12/2/12', '/', '/')
('2   1   date    ', '1   ', '1', 'date', '12-2-12', '-', '-')
('3  1   date    ', '1   ', '1', 'date', '12-2-12', '-', '-')

第五组（索引4）是您要寻找的值。

从上方使用相同的数据和相同的已编译正则表达式（请记住，这是写在Python控制台中运行的）：

for line in inp:
    hit = rx.findall(line)
    hit[0][4] if hit else None

for line in inp:
    hit = rx.match(line)
    hit.group(5) if hit else None

for line in inp:
    hit = rx.search(line)
    hit.group(5) if hit else None

每个for循环将返回：

'12/2/12'
'12-2-12'
'12-2-12'

就是这样。

我希望这会有所帮助。

Answer 4

试试这个，你的正则表达式

x = re.findall('([0-9]{2}-[0-9]{2}-[0-9]{2})', line)

正则表达式为＆＃34; mm-dd-yy＆＃34;日期格式在python中找不到任何匹配项

4 个答案: