从文件列表中删除标点符号

时间:2013-12-13 14:38:55

标签: python regex list

3   3   how are you doing???
2   5   dear, where abouts!!!!!!........
4   6   don't worry i'll be there for ya///

我有一个包含此类句子的文件。我想剥掉他们的标点符号。如何使用正则表达式进行循环和剥离。

>>> import re
>>> a="what is. your. name?"
>>> b=re.findall(r'\w+',a)
>>> b
['what', 'is', 'your', 'name']

我知道只做一句话但是当涉及到如上所列的列表时,我会感到困惑。我是python和正则表达式的新手。当我不删除句子中的标点时,它会返回这种类型的错误。

File "/usr/lib/python2.7/re.py", line 137, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python2.7/re.py", line 242, in _compile
    raise error, v # invalid expression
sre_constants.error: multiple repeat

EDiteD:句子是第3列&分隔符是制表符,那么如何从第3列中删除标点符号。

3 个答案:

答案 0 :(得分:4)

使用for循环迭代行:

with open('/path/to/file.txt') as f:
    for line in f:
        words = re.findall(r'\w+', line)
        # do something with words

with open('/path/to/file.txt') as f:
    for line in f:
        col1, col2, rest = line.split('\t', 2) # split into 3 columns
        words = re.findall(r'\w+', rest)
        line = '\t'.join(col1, col2, ' '.join(words))
        # do something with words or line

答案 1 :(得分:3)

您可以使用以下脚本:

#/usr/bin/env python
# -*- coding: utf-8 -*-

import re
import sys

with open(sys.argv[1]) as f:
    for line in f:
        print ' '.join(re.findall(r'\w+', line))

演示:

$ chmod +x strip_punc.py

$ cat input
how are you doing???
dear, where abouts!!!!!!........
don't worry i'll be there for ya///

$ ./strip_punc.py input
how are you doing
dear where abouts
don t worry i ll be there for ya

答案 2 :(得分:2)

将其与文本文件一起使用:

import re

reg = "\w+"
strings = []

with open("s.txt",'r') as txt:
    for i in txt.readlines():
        strings.append(' '.join(re.findall(reg,i)))

print strings

输出:

['how are you doing', 'dear where abouts', 'don t worry i ll be there for ya']