我的文件看起来像这样
select a,b,c FROM Xtable
select a,b,c FROM Vtable
select a,b,c FROM Atable
select a,b,c FROM Atable
select d,e,f FROM Atable
我想获得一个sortedMap
{
"Atable":["select a,b,c FROM Atable", "select d,e,f FROM Atable"],
"Vtable":["select a,b,c FROM Vtable"],
"Xtable":["select a,b,c FROM Xtable"]
}
sortedMap的键将是tableName,值是列表中的文本行。
我从此开始,但一直坚持对正则表达式匹配的行进行标记化。
import re
f = open('mytext.txt', 'r')
x = f.readlines()
print x
f.close()
for (i in x):
p = re.search(".* FROM ", i)
//now how to tokenize and get the value that follows FROM
答案 0 :(得分:1)
我们很可能不想使用正则表达式执行此任务,但是如果这样做,我们 可以从一个简单的表达式开始,也许类似于:
\"(.+?([a-z]+))\"
我们将其替换为"\2":["\1"],
,然后添加一个{}
。
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"\"(.+?([a-z]+))\""
test_str = ("\"select a,b,c FROM Xtable\"\n"
"\"select a,b,c FROM Vtable\"\n"
"\"select a,b,c FROM Atable\"\n"
"\"select a,b,c FROM Atable\"\n"
"\"select d,e,f FROM Atable\"")
subst = "\"\\2\":[\"\\1\"],"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
如果不需要此表达式,可以在regex101.com中对其进行修改/更改。
jex.im可视化正则表达式:
答案 1 :(得分:1)
您可以结合使用defaultdict
和正则表达式。假设lines
是您的行列表:
from collections import defaultdict
pattern = "(select .+ from (\S+).*)"
results = defaultdict(list)
for line in lines:
query, table = re.findall(pattern, line.strip(), flags=re.I)[0]
results[table].append(query)
实际上,读取文件的正确方法是:
with open('mytext.txt') as infile:
for line in infile:
query, table = re.findall(pattern, line.strip(), flags=re.I)[0]
results[table].append(query)
结果自然是defaultdict
。如果要将其转换为有序字典,请调用字典构造函数:
from collections import OrderedDict
OrderedDict(sorted(results.items()))
#OrderedDict([('Atable', ['select a,b,c FROM Atable', ...
您可以使pattern
更加健壮,以跟踪逗号,有效标识符等。