使用sed删除字符串末尾的句点(邮政编码)

时间:2017-08-14 22:06:15

标签: bash sed

我有一个地址文件,我正在尝试清理,我正在使用sed来删除不需要的字符和格式。在这种情况下,我有一个邮政编码,后跟一段时间:

Mr. John Doe
Exclusively Stuff, 186 
Caravelle Drive, Ponte Vedra FL
33487. 

(暂时忽略新线;我现在只关注拉链和时段)

我想从zip中删除句点(。)作为我清理它的第一步。我尝试在sed中使用子字符串,如下所示(使用“|”作为分隔符 - 我更容易看到):

sed 's|\([0-9]{4}\)\.|\1|g' test.txt

不幸的是,它没有删除期限。它只是根据这篇文章将其作为子字符串的一部分打印出来:  Replace period surrounded by characters with sed

非常感谢正确方向上的一点。

2 个答案:

答案 0 :(得分:3)

您指定了4个数字{4},但有5个,您必须逃离{},例如:

sed 's|\(^[0-9]\{5\}\).*|\1|g' test.txt

请注意,点后面还有一个空格,因此您可能希望修剪五位数后面的所有内容,但为了安全起见,您可能需要指定它们必须位于第^行的开头。

就我而言,如果我输入比info sed更完整的man sed,我会发现:

'-r'
'--regexp-extended'
     Use extended regular expressions rather than basic regular
     expressions.  Extended regexps are those that 'egrep' accepts; they
     can be clearer because they usually have less backslashes, but are
     a GNU extension and hence scripts that use them are not portable.
     *Note Extended regular expressions: Extended regexps.

Appendix A Extended regular expressions下,您可以阅读:

The only difference between basic and extended regular expressions is in
the behavior of a few characters: '?', '+', parentheses, braces ('{}'),
and '|'.  While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them _to match a
literal character_.  '|' is special here because '\|' is a GNU extension
- standard basic regular expressions do not provide its functionality.

Examples:
'abc?'
     becomes 'abc\?' when using extended regular expressions.  It
     matches the literal string 'abc?'.

'c\+'
     becomes 'c+' when using extended regular expressions.  It matches
     one or more 'c's.

'a\{3,\}'
     becomes 'a{3,}' when using extended regular expressions.  It
     matches three or more 'a's.

 '\(abc\)\{2,3\}'
     becomes '(abc){2,3}' when using extended regular expressions.  It
     matches either 'abcabc' or 'abcabcabc'.

 '\(abc*\)\1'
     becomes '(abc*)\1' when using extended regular expressions.
     Backreferences must still be escaped when using extended regular
     expressions.

答案 1 :(得分:1)

基本解决方案:使用范围原子处理已发布的输入

使用发布的输入执行此操作的一种简单(但稍微天真)的方法是查找:

  1. 行首
  2. 后跟5个数字(标准美国邮政编码)
  3. 后跟零个或多个字符(例如ZIP + 4)
  4. 后跟零个或多个非句点字符(与街道地址不匹配)
  5. 后跟一个文字句号
  6. 并且只用匹配的捕获部分替换整个匹配。例如:

    • 使用BSD sed或不使用扩展表达式:

      sed 's/^\([[:digit:]]\{5\}[^.]*\)\./\1/'
      
    • 使用GNU sed和扩展正则表达式:

      sed -r 's/^([[:digit:]]{5}[^.]*)\./\1/'
      

    无论哪种方式,根据您发布的输入,您最终得到:

    Mr. John Doe
    Exclusively Stuff, 186 
    Caravelle Drive, Ponte Vedra FL
    33487 
    

    高级解决方案:正确处理邮政编码

    主要警告是上面的解决方案适用于您发布的示例,但如果邮政编码正确地位于地址的最后一行末尾,那么它将不匹配,因为它应该在standardized USPS address中。如果您有自定义格式,这很好,但它可能会导致标准化或更正地址出现问题,例如:

    Mr. John Doe
    12345 Exclusively Stuff, 186 
    Caravelle Drive, Ponte Vedra FL 33487.
    

    以下内容适用于您的发布输入和更典型的USPS地址,但您在其他非标准输入上的里程可能会有所不同。

    # More reliable, but much harder to read.
    sed -r 's/([[:digit:]]{5}(-[[:digit:]]{4})?[[:space:]]*)\.[[:space:]]*$/\1/'