Question

我有一个名为“filelist.txt”的文件，该文件的内容是一个列表文件，我想读入我的猪脚本。例如，它可以组织为：

file1.txt
file2.txt
...
filen.txt

有些解决方案试图使用正则表达式，但文件名中没有特定的格式，我们唯一能做的就是从filelist.txt中读取文件名

每个文件中的

是我想要读取的实际数据。例如，在file1中，我们可以：

value1
value2
value3

那么我应该如何在猪脚本中读取所有这些文件值？

Answer 1

你必须使用pig load func并覆盖setlocation

 @Override
    public void setLocation(String location, Job job) throws IOException {
        //Read location where you have all the input file names and convert that into a comma seperated string.
        FileInputFormat.setInputPaths(job, [commaseperated list]);
    }

位置将是逗号分隔的文件列表。

Answer 2

目前无法在纯猪中做到这一点。你可以在纯猪中做的最好的就是使用内置的globbing，你可以找到有关here的信息。它相当灵活，但听起来不足以满足您的目的。

我能想到的另一个解决方案是，如果你可以在本地环境中获取该文件，那就是使用某种wrapper (I would recommend python)。在该脚本中，您可以读取文件并生成猪脚本以读取这些行。以下是该逻辑的工作原理：

def addLoads(filesToRead, schema, delim='\\t'):

    newLines = []
    with open(filesToRead, 'r') as infile:

        for n, f in enumerate(infile):
            newLines.append("input{} = LOAD '{}' USING PigStorage('{}') AS {};".format(n, f, delim, schema))

    to_union = [ 'input{}'.format(i) for i in range(1, len(newLines)+1) ]

    newLines.append('loaded_lines = UNION {} ;'.format(', '.join(to_union)))

    return '\n'.join(newLines)

将此附加到从磁盘加载的pig脚本的开头，并确保脚本的其余部分使用loaded_lines作为开头。

如何从Apache Pig中的文件中读取多个文件？

2 个答案: