Question

我需要将文本文件加载到RDD中，以便我可以对其包含的数据运行任务。 Driver程序是用Scala编写的，将在每个任务中执行的代码可用作通过JNI访问的本机动态库。

现在，我正在以这种方式创建RDD：

val rddFile : RDD[String] = sc.textFile(path);

我有任务的C本机代码，虽然它对真实文件使用字节级操作，即fgetc（）。我正在尝试模拟相同类型的操作（以最小化代码重构），但避免在磁盘上写入要由所述本机库处理的数据片段，这会影响性能。

以下是本机函数的定义以及我如何调用它：

natFunction(data : Array[String])
rddFile.glom().foreach(elem=>natFunction(elem))

但是，调用textFile（）生成的RDD包含String对象，这些对象需要在JNI的本机端转换为有效的C字符串。我相信所述转换应用于文件的每一行的性能影响可能很大，但仍然低于对文件的操作。

我还认为更兼容的类型是RDD [Byte]，因此我可以发送到本机端的字节数组，可以以更中等的方式转换为C字符串。

这些假设是真的吗？如果是这样，将文本文件加载为RDD [Byte]的有效方法是什么？

欢迎任何其他解决此问题的建议。

Answer 1

你可以通过using Microsoft.TeamFoundation.WorkItemTracking.WebApi; using Microsoft.TeamFoundation.WorkItemTracking.WebApi.Models; using Microsoft.VisualStudio.Services.Common; using System; using System.Collections.Generic; using System.Linq; namespace DownloadWITAttachments { class Program { static void Main(string[] args) { Uri uri = new Uri("https://{account}.visualstudio.com"); string PAT = "xxxxx personal access token"; string project = "Project Name"; VssBasicCredential credentials = new VssBasicCredential("", PAT); //create a wiql object and build our query Wiql wiql = new Wiql() { Query = "Select * " + "From WorkItems " + "Where [Work Item Type] = 'Task' " + "And [System.TeamProject] = '" + project + "' " + "And [System.State] <> 'Closed' " + "Order By [State] Asc, [Changed Date] Desc" }; //create instance of work item tracking http client using (WorkItemTrackingHttpClient workItemTrackingHttpClient = new WorkItemTrackingHttpClient(uri, credentials)) { //execute the query to get the list of work items in the results WorkItemQueryResult workItemQueryResult = workItemTrackingHttpClient.QueryByWiqlAsync(wiql).Result; //some error handling if (workItemQueryResult.WorkItems.Count() != 0) { //need to get the list of our work item ids and put them into an array List<int> list = new List<int>(); foreach (var item in workItemQueryResult.WorkItems) { list.Add(item.Id); } int[] arr = list.ToArray(); //build a list of the fields we want to see string[] fields = new string[3]; fields[0] = "System.Id"; fields[1] = "System.Title"; fields[2] = "System.State"; //get work items for the ids found in query var workItems = workItemTrackingHttpClient.GetWorkItemsAsync(arr, fields, workItemQueryResult.AsOf).Result; Console.WriteLine("Query Results: {0} items found", workItems.Count); //loop though work items and write to console foreach (var workItem in workItems) { Console.WriteLine("ID:{0} Title:{1} State:{2}", workItem.Id, workItem.Fields["System.Title"], workItem.Fields["System.State"]); } Console.ReadLine(); } } } } }从RDD[Byte]获得RDD[String]但请注意 - 很可能会发生String每个字符有2个字节（取决于区域设置，我猜）。 / p>

此外，当您有RDD [字节]时，您需要拨打电话，例如，rdd.flatMap(s => s.getBytes)将您的数据mapPartitions提供给您的C代码。在这种情况下，您将有相当大的数组传递给您的C代码，但对于每个分区，C应用程序将只被调用一次。另一种方法是使用Array[Byte]，在这种情况下，您将拥有rdd.map(s => s.getBytes)，因此每个分区将运行多个C应用程序。

我认为你可以尝试使用pipe() API来启动你的C代码，只需将RDD元素传递给你的C代码，然后输出你的C应用程序进行进一步处理。

Spark：文本文件到RDD [Byte]

1 个答案: