我正在尝试通过Cloud Dataflow学习方法。为了学习,我将其基本的Word Count example分解为一个简单的剥离功能。我想创建一个GCS对象文件名的PCollection。我收到消息说函数ReadFromText()
不可迭代。
我理解PCollections的方式是它是要处理的对象的列表。我可以编写一个循环,逐个处理每个对象,但这不是我想要的。我想保持该部分的动态,让Apache Beam处理其余部分。我只想提供GCS中的文件列表。
到目前为止,我已经成功地处理了单元素PCollections。我也不想做类似'gs://dataflow-samples/shakespeare/*'
的事情。
我还查看了gcsIO module和ReadAllFromText()。他们还说该功能是不可迭代的。请指导。
这是我到目前为止所做的:
"""A word-counting workflow."""
from __future__ import absolute_import
import argparse
import logging
import re
from past.builtins import unicode
import apache_beam as beam
from apache_beam.io import ReadFromText, ReadAllFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.io.gcp import gcsio
class WordExtractingDoFn(beam.DoFn):
"""Parse each line of input text into words."""
def __init__(self):
super(WordExtractingDoFn, self).__init__()
def process(self, element):
text_line = element.strip()
return text_line
def run(argv=None):
"""Main entry point; defines and runs the wordcount pipeline."""
p = beam.Pipeline(options=PipelineOptions())
# Read the text file[pattern] into a PCollection.
elements = ['gs://dataflow-samples/shakespeare/1kinghenryiv.txt',
'gs://dataflow-samples/shakespeare/1kinghenryvi.txt',
'gs://dataflow-samples/shakespeare/2kinghenryiv.txt',
'gs://dataflow-samples/shakespeare/2kinghenryvi.txt',
'gs://dataflow-samples/shakespeare/3kinghenryvi.txt',
'gs://dataflow-samples/shakespeare/allswellthatendswell.txt',
'gs://dataflow-samples/shakespeare/antonyandcleopatra.txt',
'gs://dataflow-samples/shakespeare/asyoulikeit.txt',
'gs://dataflow-samples/shakespeare/comedyoferrors.txt',
'gs://dataflow-samples/shakespeare/coriolanus.txt',
'gs://dataflow-samples/shakespeare/cymbeline.txt',
'gs://dataflow-samples/shakespeare/hamlet.txt',
'gs://dataflow-samples/shakespeare/juliuscaesar.txt',
'gs://dataflow-samples/shakespeare/kinghenryv.txt',
'gs://dataflow-samples/shakespeare/kinghenryviii.txt',
'gs://dataflow-samples/shakespeare/kingjohn.txt',
'gs://dataflow-samples/shakespeare/kinglear.txt',
'gs://dataflow-samples/shakespeare/kingrichardii.txt',
'gs://dataflow-samples/shakespeare/kingrichardiii.txt',
'gs://dataflow-samples/shakespeare/loverscomplaint.txt',
'gs://dataflow-samples/shakespeare/loveslabourslost.txt',
'gs://dataflow-samples/shakespeare/macbeth.txt',
'gs://dataflow-samples/shakespeare/measureforemeasure.txt',
'gs://dataflow-samples/shakespeare/merchantofvenice.txt',
'gs://dataflow-samples/shakespeare/merrywivesofwindsor.txt',
'gs://dataflow-samples/shakespeare/midsummersnightsdream.txt',
'gs://dataflow-samples/shakespeare/muchadoaboutnothing.txt',
'gs://dataflow-samples/shakespeare/othello.txt',
'gs://dataflow-samples/shakespeare/periclesprinceoftyre.txt',
'gs://dataflow-samples/shakespeare/rapeoflucrece.txt',
'gs://dataflow-samples/shakespeare/romeoandjuliet.txt',
'gs://dataflow-samples/shakespeare/sonnets.txt',
'gs://dataflow-samples/shakespeare/tamingoftheshrew.txt',
'gs://dataflow-samples/shakespeare/tempest.txt',
'gs://dataflow-samples/shakespeare/timonofathens.txt',
'gs://dataflow-samples/shakespeare/titusandronicus.txt',
'gs://dataflow-samples/shakespeare/troilusandcressida.txt',
'gs://dataflow-samples/shakespeare/twelfthnight.txt',
'gs://dataflow-samples/shakespeare/twogentlemenofverona.txt',
'gs://dataflow-samples/shakespeare/various.txt',
'gs://dataflow-samples/shakespeare/venusandadonis.txt',
'gs://dataflow-samples/shakespeare/winterstale.txt']
books = p | beam.Create((elements))
#print (books)
lines = p | 'read' >> ReadFromText(books)
counts = (lines
| 'split' >> (beam.ParDo(WordExtractingDoFn())
.with_output_types(unicode)))
output = counts | 'write' >> WriteToText('gs://ihopeitworks/Users/see.txt',shard_name_template='')
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
答案 0 :(得分:0)
您非常接近。请尝试以下操作,即不要将书籍作为ReadFromText的参数传递,而应使用ReadAllFromText通过对PCollection进行流水线读取来从书籍PCollection中读取内容。希望有帮助。
books = p | beam.Create((elements))
lines = books | 'read' >> ReadAllFromText()