Question

我的数据格式为：

"header1","header2","header3",...
"value11","value12","value13",...
"value21","value22","value23",...
....

在Scalding中解析它的最佳方法是什么？我总共有50多个专栏，但我只对其中一些专栏感兴趣。我尝试用Csv导入它（＆＃34;文件＆＃34;），但这不起作用。

我想到的唯一解决方案是使用TextLine手动解析它并忽略offset == 0的行。但我确信必须有更好的解决方案。

Answer 1

您的数据集中有88个字段（超过22个字段），而不仅仅是1.请阅读：

https://github.com/twitter/scalding/wiki/Frequently-asked-questions#what-if-i-have-more-than-22-fields-in-my-data-set

请参阅上面的文字链接：

如果我的数据集中有超过22个字段，该怎么办？

许多示例（例如在教程/目录中）显示了   在读取分隔符时，fields参数被指定为Scala元组   文件。但Scala Tuples目前限制为最多22个   元素。要读入包含超过22个字段的数据集，您可以使用   符号列表作为字段说明符。 E.g。

 val mySchema = List('first, 'last, 'phone, 'age, 'country)
 val input = Csv("/path/to/file.txt", separator = ",", 
 fields = mySchema) val output = TextLine("/path/to/out.txt") input.read
      .project('age, 'country)
      .write(Tsv(output))

指定字段的另一种方法是使用Scala Enumerations，它在开发分支中可用（截至4月2日， 2013），如教程6中所示：

object Schema extends Enumeration {
   val first, last, phone, age,country = Value // arbitrary number of fields 
}

import Schema._

Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)  
.read.project(first,age).write(Tsv("tutorial/data/output6.tsv"))

因此，在阅读文件时，使用List或Enumeration提供包含所有88个字段的模式（参见上面的链接/引用）

为了跳过标题，你可以在Csv构造函数中另外提供skipHeader = true。

Csv("tutorial/data/phones.txt", fields = Schema, skipHeader = true)

Answer 2

最后，我通过手动解析每一行来解决它：

def tipPipe = TextLine("tip").read.mapTo('line ->('field1, 'field5)) {
line: String => val arr = line.split("\",\"")
  (arr(0).replace("\"", ""), if (arr.size >= 88) arr(4) else "unknown")
}

Scalding：使用标头解析逗号分隔的数据

2 个答案: