解析不同的日期格式:正则表达式

时间:2018-04-17 22:11:27

标签: python regex python-2.7 date date-parsing

将这些问题与具体细节重新发布(因为最后一个问题被标记下来)。

我正在从档案卡中解析凌乱(tessearct-ocr)以获得至少50%的信息(date1)。根据下面的数据样本,数据行包含不同形式的日期。

Raw_Text
1   "15957-8 . 3n v g - vw, 1 ekresta . bowker, william e tley n0 .qu v- l. c. 
    s. peteris, forestville, n. y. .mafae date1 june 17,1942 by davis, c. j6 
    l. g. b. jonnis, buffalo, n. y. ngsted decl 17, 1949.3y 7 davis, c. j. 
    date3 by j date4 - by date5 by 6 -.5/, 7/19/l date6 17 jul 1916 salamanca. 
    hf date7 31 dec 1986 buffalo, new york "
2   ".1o2o83n5ddn.. -i ekresta i bowles, albert edwin i made date1 june 9p1909 
    by parker, elm. date2 dec . 18 w date3 . by dep osed by date5 by date7mqm 
    9 ivvld wm 4144, mac, .75 076 eaqlwli "
3   "i naime bowles, charles edward made date1 may 31. 1892 by mclaren, wneoi 
    date2 may 18. 1895 by mclaren, w.e. date3 . i by date4 may 10. 1908 by 
    bip. of chicago. date5 by date7 "
4   "101 557 am l i ekrestaibowles, donald manson ..46 ohio trlnlty cathedral, 
    cleveland, ohio made date1 6/19/76 by burt, ji. h. grace , cleveland, ohio 
   date2 11 jun 77 by bp j h burt date3 . 1 .. by date4 by date5 bv m cuyahoga 
   heights, ohio date6 4/29/27 date7 240000 "
5   "227354 101 575 m68, frederick augustus st. paujjs cathedral, buffalo, 
   n.y. made date1 6/15/63 by scaife. l.i... st. thomas. modia, bath, n.y. 
   date2 1/11/611 by scaife. l.eo date3 by date4 by date5 by bradford, n.y. i 
   . 130m 6/1/18 date7 17 jun 1996 foratvme new york z4uc-xl "
6   "1 95812d ll. il ekresta bowles, harry oscar lmade date14 july 17, 190433, 
    lepnard, w.a. date2 july 25 , 1905 by leonard, w.a. i date3 by date4 by 
   date5 by g- m. /(,,/mr date7 jay /z/,. /357i l /mwi yk/maj. "
7   "5025 ,.. 2.57631 il . - . .. .1 i ekresta bowles , jedwiah hibbafd made 
    deac0n 8., i5-0i1862i13y potter, iih. date2 10. 280 1864 1 biy stevens, w. 
    b. date3 by date4 7 .30 l 1875 by date5 by date7 "
8   "30.611126 ekhq il ekresta bowles, ralph hart made date1 12. 210 i1883 by 
    iwiiiliams, i36 date2 7.. 1. 1885 by williams , j. date3 by i date4 by 
    date5 by g .97) l/am 9- date7 10. 4. 1900 (78) if x/ma 3.4, 154.47.11.73. 
    4,... mya-ix "
9   "2.25678 . 1o14593 ekresta bowles, robert brigham, jr. st. matthew s 
    cathedra1,da11quexas made date1 6/18/65 by mason, c. a. 57 mmzws camp 
    dr7///9s tams date2 12 21 cs by 14.45.42 c a date3 i by date4 by date5 , 
    by houston, texas date6 4/11/30 date7 12 dec 2000 dallas texas 2400-xi "
10  "101 619 34hq woe ekresta bowlin1 howard bruce cathedral modia of saint 
    peter 61 st. paul, washin ton, dc made date1 13 jun 92 bybp r h haines 
   (wdc st. alban1s modia, annandale, vir inia . pdumd 16 jan 93 by r h halnes 
    (wdc) date3 by atas by date4 v by date5 by date6 31 aug 1946 e st. louis. 
   il date7 2400-i "
11  "w k8 8km tm boiling jack dnnmwm q- f grace ch , made dat j 11201). salem 
    mares. stverrett. f. ,w a x st. johms modia. memphis, tenh. date1 apr. 25. 
    1955 - bv barth, t.in.. date3 4 by date4 by date5 by date7 wq iw r 1 w .n 
    . 4.1- 1 date6z1l7i1c. "

我通过两步过程解析date1,       - 1.在名称" date1"之间解析文本。 " by"       - 2.使用日期解析器提取实际日期

import re
import dateutil.parser as dparser
for lines in Raw_Text:
    lines = lines.lower() #make lower case
    lines = lines.strip() #remove leading and ending spaces
    lines = " ".join(lines.split()) #remove duplicated spaces



    # Step 1
    #Extract data between "date1" and "by"
    deacondt = re.findall(r'date1(.*?)by',lines)

    deacondt = ''.join(deacondt)  #Convert list to a string


    # Step 2
    # use dateutil to parse dates in extracted data

    try:
        deacondt1 = dparser.parse(deacondt)
    except:
        deacondt1 = 'NA'

    print deacondt1

步骤1的输出是,

[' june 17,1942 ']
[' june 9p1909 ']
[' may 31. 1892 ']
[' 6/19/76 ']
[' 6/15/63 ']
['4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ']
[]
[' 12. 210 i1883 ']
[' 6/18/65 ']
[' 13 jun 92 ']
[]

虽然步骤2返回以下输出

2018-06-17 00:00:00
1909-06-17 21:00:00
1892-05-31 00:00:00
1976-06-19 00:00:00
2063-06-15 00:00:00
NA
NA
NA
2065-06-18 00:00:00
1992-06-13 00:00:00
NA

第2步未能提供所有日期。是否有比Python更好的日期解析器" dateutil.parser"?

3 个答案:

答案 0 :(得分:1)

没有解析模块可以为您可能遇到的每个OCR波形提供完整的解决方案 您必须构建一些评估/更正框架,以发现并修复您可以修复的内容。

我建议采用以下工作流程:

  1. 尝试解析日期序列。
  2. 将尚未解析的序列保存到特殊文件中
  3. 编辑文件,添加一些正则表达式替换规则,将序列重写为可抢救的形式。
  4. 应用文件中的规则并尝试再次解析
  5. 从2开始重复,直到处理好所有事情。
  6. 以下是一些示例代码:

    <强> parser.py

    import re
    import csv
    import glob, os
    from datetime import datetime
    import dateutil.parser as dparser
    
    def load_patterns():
        ''' load patterns from existing pat_*.csv 
            return a dict of the form { sequence: [sequence,pattern,replace] }
            sequence is an example of the string that should be handled by this pattern
            pattern and replace have the same meaning as for re.sub
        '''
        patterns = {}
        for pattern_file in glob.glob("pat_*.csv"):
            with open(pattern_file, 'r') as fh:
                reader = csv.DictReader(fh, delimiter=',', quotechar='"', skipinitialspace=True)
                reader.fieldnames=[f.strip() for f in reader.fieldnames]
                for row in reader:
                    # skipping empty patterns if there was non-empty one for this sequence
                    if row['sequence'] in patterns and  not row['pattern']:
                        continue
                    patterns[row['sequence']]=(row['pattern'],row['replace'])
        return patterns
    
    def save_nonmatched(patterns, nonmatched):
        ''' saves a new pattern file with the empty pattern field
            supposed to be edited manually afterwards
        '''
        items_to_save = [ key for key in nonmatched if key not in patterns ]
        if not items_to_save:
            return
    
        new_file=datetime.now().strftime('pat_%Y%m%d_%H%M%S.csv')
        with open(new_file, 'w', newline='') as fh:
            writer = csv.DictWriter(fh, fieldnames=['sequence', 'pattern', 'replace'], quoting=csv.QUOTE_ALL)
            writer.writeheader()
            for key in items_to_save:
                writer.writerow({'sequence':key, 'pattern':'', 'replace':''})
    
    def sub_with_patterns(s, patterns):
        ''' try to match each pattern in patterns iterable
            return expanded string if match succeeded
        '''
        debug=1
        for sequence, (pattern, replace) in patterns.items():
            if not pattern:
                continue
            match=re.search(pattern,s,re.X)
            if match:
                return match.expand(replace)
        return None
    
    
    nomatch={}
    patterns = load_patterns()
    Raw_Text = re.sub(r'\s+', ' ' ,open('in.txt','r').read().lower()).strip()
    
    for dt in re.findall(r'date1(.*?)by', Raw_Text, re.S):
        corrected = sub_with_patterns(dt, patterns)
        try:
            parsed = dparser.parse(corrected or dt)
            print ("input:%s parsed:%s" % (dt,parsed))
        except:
            nomatch[dt]=1
            print ("input:%s ** not parsed" % (dt))            
    
    save_nonmatched(patterns, nomatch)
    

    现在如果在输入上尝试脚本,我们会得到第一个修正csv:

    "sequence","pattern","replace"
    "4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ","",""
    " 12. 210 i1883 ","",""
    " apr. 25. 1955 - bv barth, t.in.. date3 4 ","",""
    

    和输出:

    input: june 17,1942  parsed:2018-06-17 00:00:00
    ...
    input:4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905  ** not parsed
    ...
    

    我们编辑文件如下:

    "sequence","pattern","replace"                                                    
    "4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ","^
         \s*(?P<day>\d+)
         \s+(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*)
         \s+(?P<year>\d{2})
        ","\g<day> \g<month> 19\g<year>"
    " 12. 210 i1883 ","",""
    " apr. 25. 1955 - bv barth, t.in.. date3 4 ","",""
    

    再次运行解析器:

    input: june 17,1942  parsed:2018-06-17 00:00:00
    ...
    input:4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905  parsed:1917-07-04 00:00:00
    ...
    

    当然,这远远不能解决您将要遇到的所有OCR解析问题,但这可能是一个良好的开端。

答案 1 :(得分:0)

您的许多日期都有不同的格式:这会让事情变得困难。

您可以使用datetime库来解析日期。由于您的数据有多种格式,因此您需要使用多种不同的格式字符串。

datetime有两个有用的函数:datetime.strptime(字符串PARSE时间,返回datetime.datetime)和datetime.strftime(字符串FROM时间,返回str

如果你有足够的格式字符串,这里有一个如何解析的例子:

import datetime

for lines in Raw_Text:

    ## Do the regex stuff above.
    ## Keep each returned result as a separate string.
    regex_results = get_your_regex_results()


    # Step 2
    # use dateutil to parse dates in extracted data

    date_formats = [ ## You will need several formats to try.
        '%m/%d/%Y',
    ] 

    for datestring in regex_results:

        for fmt in date_formats:
            try:
                date_str = date_str.strip()
                deacondt1 = datetime.datetime.strptime(date_str, fmt)
                print(deacondt1)
                break
            except ValueError:
                continue

https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior

答案 2 :(得分:0)

你可以试试这个,

add_filter( 'woocommerce_related_products_args', function( $args ) 
{
global $post, $wpdb, $woocommerce;
$term_list = wp_get_post_terms($post->ID, 'vendor_name', array("fields" => "ids"));
foreach($term_list as $term_single) {
$vendor_id = $term_single->ID; 
}

unset( $args['post__in'] );
$args['tax_query'] = array(
'taxonomy' => 'vendor_name',
'field'    => 'term_id',
'terms'    => $vendor_id

 );
return $args;
});
  • deacondt1 = dparser.parse(deacondt, dayfirst=False, fuzzy=True) - 允许包含fuzzy字样的字符串,例如“un-dateformat”。
  • Today is January 1, 2047 at 8:21:00AM表示dayfirst=False输入字符串与您的一样。

month-first date-format不足以提取您想要的输出,因此需要将更多近似字符串到日期格式传递给解析器。

dateutil-parser提取有关Regex

的字符串
date1

Demo ,,,其中不仅有(?s)date1\d?((?:(?!by|date2|date3).)*) &#39;还有&#39; by&#39;和&#39; date2&#39;用作date3separatordate10被视为date19

然后,对date-util解析器的可接受输入操作提取的字符串(引导和尾随空格移除等)。

date1

输出

regx= re.compile(r'(?s)date1\d?((?:(?!by|date2|date3).)*)')
raw_date= [re.sub(r'(?i)(?<=\s)[a-z]?(\d{4}|\d{2})\d*', r'\1', re.sub(r'\s+|,|(?<=\d)[^\d\s\/](?=\d)',' ', re.sub(r'^\s+|\s+$|\n+','', m))) for m in regx.findall(Raw_Text)]

for deacondt in raw_date: 
    try:
        deacondt1 = dparser.parse(deacondt, dayfirst=False, fuzzy=True)
    except:
        deacondt1 = 'NA'

print(deacondt +"\n"+ str(deacondt1))