如何解析由任意数量的空格分隔的文件中的数据?

时间:2018-05-30 19:40:46

标签: python regex scala apache-spark pyspark

我有一个如下所示的数据集:

https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

这是一个固定宽度的文本文件。

如何分隔文件中的值以将其加载到数据框中?

什么可能是分裂的好正则表达式?

var result=data.map(line=>line.split("\\t"))

我尝试使用" \ t"分隔符,但它没有正确分隔值。我不确定我必须使用哪个分隔符。

我试图将每一行拆分成一个像这样的值数组

arr = [" ACW00011604"," 17.1167"," -61.7833"," 10.1"," ST JOHNS COOLIDGE FLD"]。

我使用的是Spark 1.6,没有数据砖。

Data Frame Format:

+--------------------+---+---+---+
|  a                 | b | c | d |e 
+--------------------+---+---+---+ 
| ACW00011604 17.1...|   |   |   | 
| ACW00011647 17.1...|   |   |   | 
| AE000041196 25.3...|   |   |   | 
| AEM00041194 25.2...|   |   |   |

1 个答案:

答案 0 :(得分:0)

你需要自己解析这些行。

在Scala中,我首先要创建一个关于所需输出的案例类:

case class Measure(val1: Option[String], val2:Option[Double], val3:Option[Double], val4:Option[Double], val5:Option[String], val6:Option[String], val7:Option[Int])

根据预定的列宽进行一些自定义字符串解析:

spark
  .read
  .textFile("src/main/resources/fixedcol.txt")
    .map(str => {
      val positions = List(11, 20, 30, 37, 72, 75, 85)
      val positionsFromTo = (0 :: positions).sliding(2).map(p => (p(0),p(1))).toList
      val subItems = positionsFromTo.map{
        case (from, to) => Try(str.substring(from, to).trim).toOption
      }

      Measure(
        subItems(0),
        subItems(1).map(_.toDouble),
        subItems(2).map(_.toDouble),
        subItems(3).map(_.toDouble),
        subItems(4),
        subItems(5),
        subItems(6).map(_.toInt)
      )
    })(Encoders.product[Measure])
  .show(false)

导致

+-----------+-------+--------+------+-------------------+----+-----+
|val1       |val2   |val3    |val4  |val5               |val6|val7 |
+-----------+-------+--------+------+-------------------+----+-----+
|ACW00011604|17.1167|-61.7833|10.1  |null               |null|null |
|ACW00011647|17.1333|-61.7833|19.2  |null               |null|null |
|AE000041196|25.333 |55.517  |34.0  |SHARJAH INTER. AIRP|GSN |41196|
|AEM00041194|25.255 |55.364  |10.4  |DUBAI INTL         |    |41194|
|AEM00041217|24.433 |54.651  |26.8  |ABU DHABI INTL     |    |41217|
|AEM00041218|24.262 |55.609  |264.9 |AL AIN INTL        |    |41218|
|AF000040930|35.317 |69.017  |3366.0|NORTH-SALANG       |GSN |40930|
|AFM00040938|34.21  |62.228  |977.2 |HERAT              |    |40938|
|AFM00040948|34.566 |69.212  |1791.3|KABUL INTL         |    |40948|
|AFM00040990|31.5   |65.85   |1010.0|KANDAHAR AIRPORT   |    |40990|
|AG000060390|36.7167|3.25    |24.0  |ALGER-DAR EL BEIDA |GSN |60390|
|AG000060590|30.5667|2.8667  |397.0 |EL-GOLEA           |GSN |60590|
|AG000060611|28.05  |9.6331  |561.0 |IN-AMENAS          |GSN |60611|
|AG000060680|22.8   |5.4331  |1362.0|TAMANRASSET        |GSN |60680|
|AGE00135039|35.7297|0.65    |50.0  |null               |null|null |
|AGE00147704|36.97  |7.79    |161.0 |null               |null|null |
+-----------+-------+--------+------+-------------------+----+-----+