我正在尝试使用Apache Beam在GCP中的多个文件上读取并应用一些子设置。我准备了两个仅对一个文件有效的管道,但是在多个文件上尝试它们时失败。除此之外,如果可能的话,我会很方便地将我的管道合并成一个管道,或者有一种方法可以对它们进行编排,以便它们按顺序工作。现在,管道可以在本地运行,但最终目标是使用Dataflow运行它们。
我是textio.ReadFromText和textio.ReadAllFromText,但是在有多个文件的情况下我都无法正常工作。
You can convert with this
public static String saveHtml(Activity activity, String html) {
String filePath = "";
String fileName = "";
try {
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.M
&&
activity.checkSelfPermission(Manifest.permission.WRITE_EXTERNAL_STORAGE)
!= PackageManager.PERMISSION_GRANTED) {
ActivityCompat.requestPermissions(activity, new String[]
{Manifest.permission.WRITE_EXTERNAL_STORAGE}, ARE_Toolbar.REQ_VIDEO);
return "";
}
filePath = Environment.getExternalStorageDirectory() +
File.separator + "ARE" + File.separator;
File dir = new File(filePath);
if (!dir.exists()) {
dir.mkdir();
}
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-
dd_hh_mm_ss");
String time = dateFormat.format(new Date());
fileName = time.concat(".html");
File file = new File(filePath + fileName);
if (!file.exists()) {
boolean isCreated = file.createNewFile();
if (!isCreated) {
com.chinalwb.are.Util.toast(activity, "Cannot create file
at: " + filePath);
return "";
}
}
FileWriter fileWriter = new FileWriter(file);
fileWriter.write(html);
fileWriter.close();
com.chinalwb.are.Util.toast(activity, fileName + " has been saved
at " + filePath);
} catch (IOException e) {
e.printStackTrace();
com.chinalwb.are.Util.toast(activity, "Run into error: " +
e.getMessage());
}
return filePath + fileName;
}
**Call this method as -**
saveHtml(this, html);
这两个管道对于单个文件来说效果很好,但是我有数百个相同格式的文件,并且希望利用并行计算的优势。
是否有一种方法可以使该管道对同一目录下的多个文件起作用?
是否可以在单个管道中执行此操作,而不是创建两个不同的管道? (将文件从存储桶中写入工作节点并不方便。)
非常感谢!
答案 0 :(得分:0)
我解决了如何使它适用于多个文件的问题,但是却无法使其在单个管道中运行。我先使用循环,然后再使用beam.Flatten选项。
这是我的解决方法:
file_list = ["gs://my_bucket/file*.txt.gz"]
res_list = ["/home/subject_test_{}-00000-of-00001.json".format(i) for i in range(len(file_list))]
with beam.Pipeline(options=PipelineOptions()) as p:
for i,file in enumerate(file_list):
(p
| "Read Text {}".format(i) >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
| "Write TExt {}".format(i) >> beam.io.WriteToText("/home/subject_test_{}".format(i),
file_name_suffix=".json", num_shards=1 , append_trailing_newlines = True))
pcols = []
with beam.Pipeline(options=PipelineOptions()) as p:
for i,res in enumerate(res_list):
pcol = (p | 'read_data_{}'.format(i) >> beam.Create([res])
| "toJson_{}".format(i) >> beam.Map(toJson)
| "takeItems_{}".format(i) >> beam.FlatMap(lambda line: line["Items"])
| "takeSubjects_{}".format(i) >> beam.FlatMap(lambda line: line['data']['subjects']))
pcols.append(pcol)
out = (pcols
| beam.Flatten()
| beam.combiners.Count.PerElement()
| beam.io.WriteToText("/home/items",
file_name_suffix=".txt", num_shards=1 , append_trailing_newlines = True))