Question

我想在某些满足特定条件的Hive表中找到所有列。但是，我编写的代码执行起来很慢，因为Spark不是特别喜欢循环：

matches = {}
for table in table_list:
    matching_cols = [c for c in spark.read.table(table).columns if substring in c]
    if matching_cols:
        matches[table] = matching_cols

我想要类似的东西

matches = {'table1': ['column1', 'column2'], 'table2': ['column2']}

如何更有效地获得相同的结果？

Answer 1

一个同事刚想通了。这是修改后的解决方案：

matches = {}
for table in table_list:
    matching_cols = spark.sql("describe {}".format(table)) \
                         .where(col('col_name').rlike(substring)) \
                         .collect()

    if matching_cols:
        matches[table] = [c.col_name for c in matching_cols]

这里的主要区别在于，在我之前的示例中，Spark似乎在缓存分区信息，因此为什么每个循环都越来越麻烦。访问元数据以刮除列而不是表本身，将绕过该问题。

Answer 2

如果表字段中有上述代码的注释，则会在额外的信息（注释）上出现问题，另外还要注意，HBase链接表也会出现问题...

示例：

create TABLE deck_test (
COLOR string COMMENT 'COLOR Address',
SUIT string COMMENT '4 type Suits',
PIP string)
ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
STORED AS TEXTFILE;

describe deck_test;
color                   string                  COLOR Address
suit                    string                  4 type Suits
pip                     string

处理评论问题，进行一些小的更改可能会有所帮助...

matches = {}
for table in table_list:
    matching_cols = spark.sql("show columns in {}".format(table)).where(col('result').rlike(substring)).collect()
    if matching_cols:
        matches[table] = [c.col_name for c in matching_cols]

如何在Spark中高效地获取许多表的列？

2 个答案: