从字符串python中删除时间戳和网址

时间:2018-08-21 20:36:17

标签: string python-3.x timestamp data-cleaning

我有一个字符串,必须从中删除时间戳记和标点符号。而且我还必须删除所有数字,但responseCode值 在这种情况下必须保持例如400。不论400到哪里,都不应将其删除。而且我想删除所有网址 和文件名以tar.gz结尾。

mystr="sun aug 19 13:02:09 2018 I_am.98189:  hello please connect to the local host:8080 
sun aug 19 13:02:10 2018 hey.94289:  hello not able to find the file 
sun aug 19 13:02:10 2018 I_am.94289: Base url for file_transfer is: abc/vd/filename.tar.gz 
mon aug 19 13:02:10 2018 how_94289: $var1={ 
  'responseCode' = '400', 
  'responseDate' = 'Sun, 19 Aug 2018 13:02:08 ET', 
  'responseContent' = 'ABC'  }
mon aug 20 13:02:10 2018 hello!94289: Error performing action, failed with error code [400]
"

预期结果:

"I_am hello please connect to the local host 
hello not able to find the file 
Base url for file_transfer 
var1 
  responseCode = 400 
  responseDate  
  responseContent = ABC 
Error performing action, failed with error code 400
"

我删除标点符号的解决方案:

punctuations = '''!=()-[]{};:'"\,<>.?@#$%^&*_~'''
no_punct = ""
for char in mystr:
   if char not in punctuations:
       no_punct = no_punct + char

# display the unpunctuated string
print(no_punct)

1 个答案:

答案 0 :(得分:1)

也许:

patterns = [r"\w{3} \w{3} \d{2} \d{2}:\d{2}:\d{2} \d{4}\s*",    #sun aug 19 13:02:10 2018
        r"\w{3}, \d{2} \w{3} \d{4} \d{2}:\d{2}:\d{2} \w{2}\s*", #Sun, 19 Aug 2018 13:02:08 ET
        r":\s*([\da-zA_Z]+\/)+([a-zA-Z0-9\.]+)",                #URL
        r"([a-zA-Z_!]+)[\.!_]\d+:\s*",                          #word[._!]number:>=0space
        r":\d+",
        "[/':,${}\[\]]"                                         #punctuations
        ]

s = mystr

for p in patterns:
    s = re.sub(p,'', s)

s = s.strip()

print(s)

输出:

hello please connect to the local host
hello not able to find the file
Base url for file_transfer is
var1= 
  responseCode = 400 
  responseDate =  
  responseContent = ABC  
Error performing action failed with error code 400