我正在尝试使用GCP Dataflow Python使用其第一个字符处理输入文本文件。如果条目的第一个字符是'A',我想将文件存储在A.txt中,依此类推。同样,我有一个与每个角色相关联的数字。我为此存储了两个哈希图。以下是我的代码:
splitHashMap={'A':1,'F':4, 'J':4, 'Z':4, 'G':10, 'I':11};
fileHashMap= {'A':'A.txt','B':'B.txt','F':'F.txt','J':'J.txt','Z':'Z.txt','G':'G.txt','I':'I.txt'};
def to_table_row(x):
firstChar=x[0][0];
global splitHashMap
global fileHashMap
print splitHashMap[firstChar];
x | WriteToText(fileHashMap[firstChar]);
return {firstChar}
错误与WriteToText函数有关,如下所示:
PTransform Create: Refusing to treat string as an iterable. (string=u'AIGLM0012016-02-180000000112016-02-18-12.00.00.123456GB CARMB00132') [while running 'ToTableRows']
有人可以帮我解决这个问题吗?
编辑:包含管道的代码的其余部分如下:
arser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Input file to process.')
parser.add_argument('--output',
dest='output',
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(None)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
lines = p | 'read' >> ReadFromText(known_args.input)
lines | 'ToTableRows' >> beam.Map(to_table_row);
result = p.run()
我请你帮我解决这个问题吧。我用来调用python文件的命令是:
python File_parse.py ---input temp.txt
Temp.txt如下:
Aasadasd asdasd adsad af
Jdsad asdasd asd as
A asdd ad agfsfg sfg
Z afsdfrew320pjpoji
Idadfsd w8480ujfds
所需的输出是以“A”开头的所有文件都转到“A.txt”,“B”转到“B.txt”等等。如果您在回复中编写代码,那就太好了。
答案 0 :(得分:0)
您对WriteToText
的使用不合适。您不能将字符串传递给PTransform。相反,您需要将PCollections传递给PTransforms。在下面的代码中,您可以为第一个字符的每个案例创建单独的PCollections,并传递
在这种情况下你可以做的是这样的事情:
file_hash_map= {'A':'A.txt','B':'B.txt','F':'F.txt',
'J':'J.txt','Z':'Z.txt','G':'G.txt','I':'I.txt'}
existing_chars = file_hash_map.keys()
class ToTableRowDoFn(beam.DoFn):
def process(self, element):
first_char = element[0][0]
if first_char in file_hash_map:
yield pvalue.TaggedOutput(first_char, element)
else:
# When the first char of the word is not from the allowed
# characters, we just send it to the main output.
yield element
lines = p | 'read' >> ReadFromText(known_args.input)
multiple_outputs = (
lines |
'ToTableRows' >> beam.ParDo(ToTableRowDoFn())
.with_outputs(*existing_chars, main='main'));
for pcollection_name in existing_chars:
char_pcollection = getattr(multiple_outputs, pcollection_name)
char_pcollection | WriteToFile(file_hash_map[pcollection_name])
此代码的关键在于for
循环,我们遍历每个输出PCollections,并将它们的内容分别写入不同的文件。