我有一个如下所示的数据集:
https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
这是一个固定宽度的文本文件。
如何分隔文件中的值以将其加载到数据框中?
什么可能是分裂的好正则表达式?
var result=data.map(line=>line.split("\\t"))
我尝试使用" \ t"分隔符,但它没有正确分隔值。我不确定我必须使用哪个分隔符。
我试图将每一行拆分成一个像这样的值数组
arr = [" ACW00011604"," 17.1167"," -61.7833"," 10.1"," ST JOHNS COOLIDGE FLD"]。
我使用的是Spark 1.6,没有数据砖。
+--------------------+---+---+---+
| a | b | c | d |e
+--------------------+---+---+---+
| ACW00011604 17.1...| | | |
| ACW00011647 17.1...| | | |
| AE000041196 25.3...| | | |
| AEM00041194 25.2...| | | |
答案 0 :(得分:0)
你需要自己解析这些行。
在Scala中,我首先要创建一个关于所需输出的案例类:
case class Measure(val1: Option[String], val2:Option[Double], val3:Option[Double], val4:Option[Double], val5:Option[String], val6:Option[String], val7:Option[Int])
根据预定的列宽进行一些自定义字符串解析:
spark
.read
.textFile("src/main/resources/fixedcol.txt")
.map(str => {
val positions = List(11, 20, 30, 37, 72, 75, 85)
val positionsFromTo = (0 :: positions).sliding(2).map(p => (p(0),p(1))).toList
val subItems = positionsFromTo.map{
case (from, to) => Try(str.substring(from, to).trim).toOption
}
Measure(
subItems(0),
subItems(1).map(_.toDouble),
subItems(2).map(_.toDouble),
subItems(3).map(_.toDouble),
subItems(4),
subItems(5),
subItems(6).map(_.toInt)
)
})(Encoders.product[Measure])
.show(false)
导致
+-----------+-------+--------+------+-------------------+----+-----+
|val1 |val2 |val3 |val4 |val5 |val6|val7 |
+-----------+-------+--------+------+-------------------+----+-----+
|ACW00011604|17.1167|-61.7833|10.1 |null |null|null |
|ACW00011647|17.1333|-61.7833|19.2 |null |null|null |
|AE000041196|25.333 |55.517 |34.0 |SHARJAH INTER. AIRP|GSN |41196|
|AEM00041194|25.255 |55.364 |10.4 |DUBAI INTL | |41194|
|AEM00041217|24.433 |54.651 |26.8 |ABU DHABI INTL | |41217|
|AEM00041218|24.262 |55.609 |264.9 |AL AIN INTL | |41218|
|AF000040930|35.317 |69.017 |3366.0|NORTH-SALANG |GSN |40930|
|AFM00040938|34.21 |62.228 |977.2 |HERAT | |40938|
|AFM00040948|34.566 |69.212 |1791.3|KABUL INTL | |40948|
|AFM00040990|31.5 |65.85 |1010.0|KANDAHAR AIRPORT | |40990|
|AG000060390|36.7167|3.25 |24.0 |ALGER-DAR EL BEIDA |GSN |60390|
|AG000060590|30.5667|2.8667 |397.0 |EL-GOLEA |GSN |60590|
|AG000060611|28.05 |9.6331 |561.0 |IN-AMENAS |GSN |60611|
|AG000060680|22.8 |5.4331 |1362.0|TAMANRASSET |GSN |60680|
|AGE00135039|35.7297|0.65 |50.0 |null |null|null |
|AGE00147704|36.97 |7.79 |161.0 |null |null|null |
+-----------+-------+--------+------+-------------------+----+-----+