我遇到了一个棘手的问题。我知道那里的python中有很多'大师'。所以请帮助我。我有一个巨大的日志文件。格式是这样的:
[text hello world yadda
lines lines lines
exceptions]
[something i'm not interested in]
[text hello world yadda
lines lines lines
exceptions]
依旧...... 因此第1和第3块是相同的。并且有这样的多个案例。我的问题是如何读取此文件并在输出文件中只写入唯一的块?如果有重复,则应该只写一次。有时在两个重复的块之间有多个块。我实际上是模式匹配,这是现在的代码。它只与模式匹配,但对重复项没有任何作用。
import re
import sys
from itertools import islice
try:
if len(sys.argv) != 3:
sys.exit("You should enter 3 parameters.")
elif sys.argv[1] == sys.argv[2]:
sys.exit("The two file names cannot be the same.")
else:
file = open(sys.argv[1], "r")
file1 = open(sys.argv[2],"w")
java_regex = re.compile(r'[java|javax|org|com]+?[\.|:]+?', re.I) # java
at_regex = re.compile(r'at\s', re.I) # at
copy = False # flag that control to copy or to not copy to output
for line in file:
if re.search(java_regex, line) and not (re.search(r'at\s', line, re.I) or re.search(r'mdcloginid:|webcontainer|c\.h\.i\.h\.p\.u\.e|threadPoolTaskExecutor|caused\sby', line, re.I)):
# start copying if "java" is in the input
copy = True
else:
if copy and not re.search(at_regex, line):
# stop copying if "at" is not in the input
copy = False
if copy:
file1.write(line)
file.close()
file1.close()
except IOError:
sys.exit("IO error or wrong file name.")
except IndexError:
sys.exit('\nYou must enter 3 parameters.') #prevents less than 3 inputs which is mandatory
except SystemExit as e: #Exception handles sys.exit()
sys.exit(e)
我不在乎这是否必须在此代码中(删除重复项)。它也可以在单独的.py文件中。无所谓 这是日志文件的原始片段:
javax.xml.ws.soap.SOAPFaultException: Uncaught BPEL fault http://schemas.xmlsoap.org/soap/envelope/:Server
at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.createSystemException(MethodMarshallerUtils.java:1326) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.demarshalFaultResponse(MethodMarshallerUtils.java:1052) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.marshaller.impl.alt.DocLitBareMethodMarshaller.demarshalFaultResponse(DocLitBareMethodMarshaller.java:415) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.getFaultResponse(JAXWSProxyHandler.java:597) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.createResponse(JAXWSProxyHandler.java:537) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:403) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invoke(JAXWSProxyHandler.java:188) ~[org.apache.axis2.jar:na]
com.hcentive.utils.exception.HCRuntimeException: Unable to Find User Profile:null
at com.hcentive.agent.service.AgentServiceImpl.getAgentByUserProfile(AgentServiceImpl.java:275) ~[agent-service-core-4.0.0.jar:na]
at com.hcentive.agent.service.AgentServiceImpl$$FastClassByCGLIB$$e3caddab.invoke(<generated>) ~[cglib-2.2.jar:na]
at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191) ~[cglib-2.2.jar:na]
at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110) ~[spring-tx-3.1.2.RELEASE.jar:3.1.2.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
at org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:64) ~[spring-security-core-3.1.2.RELEASE.jar:3.1.2.RELEASE]
javax.xml.ws.soap.SOAPFaultException: Uncaught BPEL fault http://schemas.xmlsoap.org/soap/envelope/:Server
at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.createSystemException(MethodMarshallerUtils.java:1326) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.demarshalFaultResponse(MethodMarshallerUtils.java:1052) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.marshaller.impl.alt.DocLitBareMethodMarshaller.demarshalFaultResponse(DocLitBareMethodMarshaller.java:415) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.getFaultResponse(JAXWSProxyHandler.java:597) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.createResponse(JAXWSProxyHandler.java:537) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:403) ~[org.apache.axis2.jar:na]
And so on and on....
答案 0 :(得分:1)
您可以删除重复的块:
import re
yourstr = r'''
[text hello world yadda
lines lines lines
exceptions]
[something i'm not interested in]
[text hello world yadda
lines lines lines
exceptions]
'''
pat = re.compile(r'\[([^]]+])(?=.*\[\1)', re.DOTALL)
result = pat.sub('', yourstr)
请注意,只保留最后一个块,如果您想要第一个块,则必须反转该字符串并使用此模式:
(][^[]+)\[(?=.*\1\[)
然后再次反转字符串。
答案 1 :(得分:0)
您可以使用hashlib中的散列算法和如下所示的字典:{123456789:True} 这个值并不重要,但如果它是一个大文件,那么dict会比列表明显更快。
无论如何,只要不在字典中,您可以在每个块出现时对其进行散列并将其存储在字典中。如果它在字典中,则忽略该块。假设您的块结构完全相同。