I have a file that contains:
user_name order_id M_Status
jOHN 1000 married to Emma
each "Column" is separated from the following one by 5 spaces, the spaces count can change in another string, and since there is a single space between each word under M_Status column splitting by (" +") didn't work since the M_Status need to be one string, so what I'm trying to do is count the spaces between words in the first line then split all the remaining lines by the correct number of spaces (5 but could change in another file).
UPDATE:
val delimitersList = List(",", ";", ":", "\\|", "\\t", " ")
def findCommonDelimiter(line: String, sep: Option[String], typeToCheck: String): (List[String], String) = {
val delimiterMap = scala.collection.mutable.LinkedHashMap[String, Int]()// this needs to be changed to find how many times a delimiter is repeated between two columns
for (a <- delimitersList)
delimiterMap += a -> (a + "+").r.findAllIn(line).length
try {
val sortedMap = (delimiterMap.toList sortWith ((x, y) => x._2 > y._2)).take(3)
var splitChar = ""
val firstDelimiter = sortedMap.head._1.toString
val firstDelimiterCount = sortedMap.head._2
val secondDelimiter = sortedMap.drop(1).head._1.toString
val secondDelimiterCount = sortedMap.drop(1).head._2
val thirdDelimiter=sortedMap.drop(2).head._1.toString
val lineSplit=line.split("\\r?\\n")
if (!firstDelimiter.equalsIgnoreCase(",") &&
secondDelimiter.equalsIgnoreCase(",") &&
secondDelimiterCount > 0 &&
!typeToCheck.equalsIgnoreCase("map") {//(firstDelimiterCount - commaCount) <= 1 && commaCount > 0) {
splitChar = ","
} else if (firstDelimiter.equalsIgnoreCase(" ") || firstDelimiter.equalsIgnoreCase("\\t")) {
if (lineSplit(0).split(thirdDelimiter, 2).length == 2 &&
typeToCheck.equalsIgnoreCase("map") &&
((secondDelimiter.equalsIgnoreCase(",") &&
secondDelimiterCount > 0) || (secondDelimiter.equalsIgnoreCase(";") && secondDelimiterCount > 0))) {
splitChar = thirdDelimiter
} else if (lineSplit(0).split(secondDelimiter,2).length == 2 && typeToCheck.equalsIgnoreCase("map")) {
splitChar = secondDelimiter
} else if (typeToCheck.equalsIgnoreCase("header") && firstDelimiter.equalsIgnoreCase("\\t")) {
splitChar = "\\t"
} else if (typeToCheck.equalsIgnoreCase("header") &&
firstDelimiter.equalsIgnoreCase(" ") &&
secondDelimiterCount > 0) {
if ((firstDelimiterCount- secondDelimiterCount >= firstDelimiterCount / 2))
splitChar = secondDelimiter
} else {
if (firstDelimiter.equalsIgnoreCase(" ") &&
secondDelimiterCount > 0 &&
(firstDelimiterCount - secondDelimiterCount >= firstDelimiterCount / 2))
splitChar = secondDelimiter
else
splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)
}
} else
splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)
if (!splitChar.equalsIgnoreCase("""\|""") && !splitChar.equalsIgnoreCase("\\t")) {
// println("===>"+splitChar)
// if(!splitChar.equalsIgnoreCase(""))
(line.split(splitChar, -1).toList, splitChar)
} else {
if (splitChar.equalsIgnoreCase("""\|"""))
(line.split("\\|", -1).toList, splitChar)
else
(line.split("\\t", -1).toList, splitChar)
}
} catch {
case e: Exception => {
e.printStackTrace()
(List(line), "")
}
}
}
Thanks
答案 0 :(得分:1)
You can use \\s+
for split
multi spaces with the limit
param to limit the split
results size., like:
scala> "jOHN 1000 married to Emma".split("\\s+", 3)
res5: Array[String] = Array(jOHN, 1000, married to Emma)
答案 1 :(得分:1)
我已经将一些代码整合在一起以获取您的空间。 有点啰嗦,但它确实有效。
从here
借用@Kevin Wrightsplit
函数
def split[T](list: List[T]) : List[List[T]] = list match {
case Nil => Nil
case h::t => val segment = list takeWhile {h ==}
segment :: split(list drop segment.length)
}
你可以去:
scala> val line = "JOHN 1000 married to Emma"
line: String = JOHN 1000 married to Emma
scala> val lengthOfSpaces = split(line.toCharArray.toList).
| filter(x => x.head.equals(' ') && x.size > 1).
| map(y => y.length).
| distinct.head
lengthOfSpaces: Int = 5
scala> line.split(" " * lengthOfSpaces)
res39: Array[String] = Array(JOHN, 1000, married to Emma)
如果您有额外的列,也会有效:
scala> val line2 = "jOHN 1000 231 any street married to Emma"
line2: String = jOHN 1000 231 any street married to Emma
scala> line2.split(" " * lengthOfSpaces)
res47: Array[String] = Array(jOHN, 1000, 231 any street, married to Emma)
我已经假设列之间的空格在每一行中都是一个统一的值。因此,user_name
和order_id
之间不能有5个空格,order_id
和下一列之间不能有4个空格。
此外,如果您要在列和单词之间使用相同数量的空格,则可能应首先规范化您的数据。 @ jan0sch之前曾建议用标签完成空格。
答案 2 :(得分:0)
我编辑了解决方案以反映您的实际问题。这有点矫枉过正,但会解决它。首先,我们分析标题行来计算空格。为此,我们假设您知道列数。然后剩下的就是构建适当的分割参数。
@ val h = "user_name order_id M_Status"
h: String = "user_name order_id M_Status"
@ val c = (h.split("\\s+").fold("")(_ ++ _).length - h.foldLeft(0)((a, b) => if (b == ' ') a + 1 else a)) / 3
c: Int = 5
@ " jOHN 1000 married to Emma".split(s" {$c}")
res18: Array[String] = Array(" jOHN", "1000", "married to Emma")
最好还是计算列数......
答案 3 :(得分:0)
您可以使用正则表达式在more than 1 space
个案例中拆分该行。不过不需要数数。
scala> "jOHN Doe 1000 married to Emma".split("""[\s]{2,}""")
res1: Array[String] = Array(jOHN Doe, 1000, married to Emma)