Question

我有一个类似于下面的日志文件：

/* BUG: axiom too complex: SubClassOf(ObjectOneOf([NamedIndividual(http://www.sem.org/sina/onto/2015/7/TSB-GCL#t_Xi_xi)]),DataHasValue(DataProperty(http://www.code.org/onto/ont.owl#XoX_type),^^(periodic,http://www.mdos.org/1956/21/2-rdf-syntax-ns#PlainLiteral))) */
/* BUG: axiom too complex: SubClassOf(ObjectOneOf([NamedIndividual(http://www.sem.org/sina/onto/2015/7/TSB-GCL#t_Ziz)]),DataHasValue(DataProperty(http://www.co-ode.org/ontologies/ont.owl#YoY_type),^^(latency,http://www.w3.org/1956/01/11-rdf-syntax-ns#PlainLiteral))) */
....

我想提取 t_Xi_xi ， t_Ziz ， XoX_type 和 YoY_type 的字段以及之后的值 ^^（在这种情况下延迟和定期。

注意：文件中的每个 X 和 Y 都有不同的字母值（例如X =＆＃34; sina＆＃34; Y =＆＃34; Boom＆＃34; so - ＆gt; t_Xi_xi~t_Sina_sina）所以我想使用正则表达式会是一个更好的选择。

所以最终结果必须如下：

t_Xi_xi    XoX_type    periodic
t_Ziz    YoY_type    latency

我已尝试使用下面的正则表达式来提取它们，并希望能够将其余部分替换为＆＃34; ＆＃34;在shell中的 sed 的帮助下，在文件中，但我失败了。

([a-zA-Z]_[a-zA-Z]*_[a-zA-Z]*)|(\#[a-zA-Z]*_[a-zA-Z]*)|(\^\([a-zA-Z]*)+

对于如何在Python（甚至shell本身）中执行此操作，我们非常感激。

Answer 1

$ awk -F'#|\\^\\^\\(' '{for (i=2; i<NF; i++) printf "%s%s", gensub(/[^[:alnum:]_].*/,"",1,$i), (i<(NF-1) ? OFS : ORS) }' file
t_Xi_xi XoX_type periodic
t_Ziz YoY_type latency

以上使用GNU awk for gensub（），其他awks你使用sub（）和一个单独的printf语句。

从文件中提取多个子字符串，并使用python / shell

1 个答案: