Question

I have a file that contains:

user_name     order_id     M_Status
jOHN          1000         married to Emma

each "Column" is separated from the following one by 5 spaces, the spaces count can change in another string, and since there is a single space between each word under M_Status column splitting by (" +") didn't work since the M_Status need to be one string, so what I'm trying to do is count the spaces between words in the first line then split all the remaining lines by the correct number of spaces (5 but could change in another file).

UPDATE:

val delimitersList = List(",", ";", ":", "\\|", "\\t", " ")

def findCommonDelimiter(line: String, sep: Option[String], typeToCheck: String): (List[String], String) = {
  val delimiterMap = scala.collection.mutable.LinkedHashMap[String, Int]()// this needs to be changed to find how many times a delimiter is repeated between two columns
  for (a <- delimitersList)
    delimiterMap += a -> (a + "+").r.findAllIn(line).length

  try {
    val sortedMap = (delimiterMap.toList sortWith ((x, y) => x._2 > y._2)).take(3)
    var splitChar = ""
    val firstDelimiter = sortedMap.head._1.toString
    val firstDelimiterCount = sortedMap.head._2
    val secondDelimiter = sortedMap.drop(1).head._1.toString
    val secondDelimiterCount = sortedMap.drop(1).head._2
    val thirdDelimiter=sortedMap.drop(2).head._1.toString
    val lineSplit=line.split("\\r?\\n")
    if (!firstDelimiter.equalsIgnoreCase(",") &&
       secondDelimiter.equalsIgnoreCase(",") &&
       secondDelimiterCount > 0 &&
       !typeToCheck.equalsIgnoreCase("map") {//(firstDelimiterCount - commaCount) <= 1 && commaCount > 0) {
      splitChar = ","
    } else if (firstDelimiter.equalsIgnoreCase(" ") || firstDelimiter.equalsIgnoreCase("\\t")) {
      if (lineSplit(0).split(thirdDelimiter, 2).length == 2 &&
         typeToCheck.equalsIgnoreCase("map") &&
         ((secondDelimiter.equalsIgnoreCase(",") &&
         secondDelimiterCount > 0) || (secondDelimiter.equalsIgnoreCase(";") && secondDelimiterCount > 0))) {
        splitChar = thirdDelimiter
      } else if (lineSplit(0).split(secondDelimiter,2).length == 2 && typeToCheck.equalsIgnoreCase("map")) {
        splitChar = secondDelimiter
      } else if (typeToCheck.equalsIgnoreCase("header") && firstDelimiter.equalsIgnoreCase("\\t")) {
        splitChar = "\\t"
      } else if (typeToCheck.equalsIgnoreCase("header") &&
                firstDelimiter.equalsIgnoreCase(" ") &&
                secondDelimiterCount > 0) {
        if ((firstDelimiterCount- secondDelimiterCount >= firstDelimiterCount / 2))
          splitChar = secondDelimiter
      } else {
        if (firstDelimiter.equalsIgnoreCase(" ") &&
           secondDelimiterCount > 0 &&
           (firstDelimiterCount - secondDelimiterCount >= firstDelimiterCount / 2))
          splitChar = secondDelimiter
        else
          splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)
      }
    } else
      splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)

    if (!splitChar.equalsIgnoreCase("""\|""") && !splitChar.equalsIgnoreCase("\\t")) {
      // println("===>"+splitChar)
      // if(!splitChar.equalsIgnoreCase(""))
      (line.split(splitChar, -1).toList, splitChar)
    } else {
      if (splitChar.equalsIgnoreCase("""\|"""))
        (line.split("\\|", -1).toList, splitChar)
      else
        (line.split("\\t", -1).toList, splitChar)
    }
  } catch {
    case e: Exception => {
      e.printStackTrace()
      (List(line), "")
    }
  }
}

Thanks

Answer 1

You can use \\s+ for split multi spaces with the limit param to limit the split results size., like:

scala> "jOHN     1000     married to Emma".split("\\s+", 3)
res5: Array[String] = Array(jOHN, 1000, married to Emma)

Answer 2

我已经将一些代码整合在一起以获取您的空间。有点啰嗦，但它确实有效。

从here

借用@Kevin Wright split函数

def split[T](list: List[T]) : List[List[T]] = list match {
  case Nil => Nil
  case h::t => val segment = list takeWhile {h ==}
    segment :: split(list drop segment.length)
}

你可以去：

scala> val line = "JOHN     1000     married to Emma"
line: String = JOHN     1000     married to Emma

scala> val lengthOfSpaces = split(line.toCharArray.toList).
     | filter(x => x.head.equals(' ') && x.size > 1).
     | map(y => y.length).
     | distinct.head
lengthOfSpaces: Int = 5

scala> line.split(" " * lengthOfSpaces)
res39: Array[String] = Array(JOHN, 1000, married to Emma)

如果您有额外的列，也会有效：

scala> val line2 = "jOHN     1000     231 any street     married to Emma"
line2: String = jOHN     1000     231 any street     married to Emma

scala> line2.split(" " * lengthOfSpaces)
res47: Array[String] = Array(jOHN, 1000, 231 any street, married to Emma)

我已经假设列之间的空格在每一行中都是一个统一的值。因此，user_name和order_id之间不能有5个空格，order_id和下一列之间不能有4个空格。

此外，如果您要在列和单词之间使用相同数量的空格，则可能应首先规范化您的数据。 @ jan0sch之前曾建议用标签完成空格。

Answer 3

我编辑了解决方案以反映您的实际问题。这有点矫枉过正，但会解决它。首先，我们分析标题行来计算空格。为此，我们假设您知道列数。然后剩下的就是构建适当的分割参数。

@ val h = "user_name     order_id     M_Status" 
h: String = "user_name     order_id     M_Status"
@ val c = (h.split("\\s+").fold("")(_ ++ _).length - h.foldLeft(0)((a, b) => if (b == ' ') a + 1 else a)) / 3                                                                                                        
c: Int = 5                                                                                               
@ " jOHN     1000     married to Emma".split(s" {$c}") 
res18: Array[String] = Array(" jOHN", "1000", "married to Emma")

最好还是计算列数......

Answer 4

您可以使用正则表达式在more than 1 space个案例中拆分该行。不过不需要数数。

scala> "jOHN Doe     1000     married to Emma".split("""[\s]{2,}""")
res1: Array[String] = Array(jOHN Doe, 1000, married to Emma)

Count spaces between words in a string

4 个答案: