Question

我在文本文件中有1000个SQL选择查询。例如，下面的查询：

select first_name, title, salary
from employees, salaries
where
    employees.emp_no = salaries.emp_no
    and first_name = 'attila'
    and last_name = '1UB4pqakE3'

我想以这种形式将其保存在另一个文件中：

（first_name，title，salary，employees.emp_no，salaries.emp_no，last_name）

我的意思是我想处理原始文件的每一行以保留属性。我想知道我怎么能用Python做到这一点？

Answer 1

这是使用正则表达式和设置/列表理解的问题的天真解决方案。

首先将文件加载到变量中：

with open('some_query.sql') as file:
    txt = file.read()

然后对内容进行标记，删除文本值（以'开头）和SQL关键字：

import re

# Tokenize words
regex = re.compile("([\w._']+)")
tokens = regex.findall(txt)

# Set Comprehension removing text fields:
tokens = { token for token in tokens if not(token.startswith("'")) }

# SQL Keywords:
keywords = {'select', 'from', 'where', 'limit', 'and', 'or', 'not'}

# Identifiers:
identifiers = tokens - keywords

对于您的试用样本，它会返回：

{'employees',
 'employees.emp_no',
 'first_name',
 'last_name',
 'salaries',
 'salaries.emp_no',
 'salary',
 'title'}

哪个是没有重复的列标识符的排序列表。

如果外观顺序真正重要并且允许重复（如输出所示），那么只需更改上面的代码，效率稍低：

# Set Comprehension removing text fields:
tokens = [ token for token in tokens if not(token.startswith("'")) ]

# SQL Keywords
keywords = {'select', 'from', 'where', 'limit', 'and', 'or', 'not'}

# Identifiers:
identifiers = tuple([token for token in tokens if not(token in keywords)])

这导致：

('first_name',
 'title',
 'salary',
 'employees',
 'salaries',
 'employees.emp_no',
 'salaries.emp_no',
 'first_name',
 'last_name')

最后，将set写回另一个文件：

with open('some_query.key', 'w') as file:
    file.write("\n".join(tokens))

此时它只是将此代码封装在一个函数中并参数化文件名，以便将其应用于所有文件。

注意：此解决方案还会捕获表标识符。

拆分文件行并将其保存在文件中

1 个答案: