Question

我想将一个表示选项的字符串用于spark-submit命令，并在{}之间插入--conf格式化它们。此

concatConf :: String -> String
concatConf = foldl (\acc c -> acc ++ " --conf " ++ c) "" . words

适用于大多数选项集合，例如，

λ => concatConf "spark.yarn.memoryOverhead=3g spark.default.parallelism=1000 spark.yarn.executor.memoryOverhead=2000"
" --conf spark.yarn.memoryOverhead=3g --conf spark.default.parallelism=1000 --conf spark.yarn.executor.memoryOverhead=2000"

但有时会有spark.executor.extraJavaOptions，这是一个以空格分隔的转义引号，附加选项列表;例如，

"spark.yarn.memoryOverhead=3g spark.executor.extraJavaOptions=\"-verbose:gc -XX:+UseSerialGC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\" spark.default.parallelism=1000 spark.yarn.executor.memoryOverhead=2000"

并且上面的concatConf函数明显崩溃了。

使用regex-compat库的以下函数适用于此示例

import Data.Monoid (<>)
import Text.Regex (mkRegex, matchRegexAll)

concatConf :: String -> String
concatConf conf = let regex = mkRegex "(\\ *.*extraJavaOptions=\\\".*\\\")"
                  in case matchRegexAll regex conf of
                    Just (x, y, z, _) -> (insConf x) <> " --conf " <> y <> (insConf z)
                    Nothing           -> ""
                  where insConf = foldl (\acc c -> acc ++ " --conf " ++ c) "" . words

直到你发现有一个类似spark.driver.extraJavaOptions的类似格式。在任何情况下，当没有这样的选项时，此功能不起作用。现在我在很多情况下都在苦苦挣扎：没有一个或者一个或两个，如果它出现在字符串中首先出现哪个等等。

这种让我觉得正则表达式不是正确的工具，因此我的问题是，这项工作的正确工具是什么？

Answer 1

这种感觉使我觉得正则表达式不是该工作的正确工具，因此我的问题是，什么是该工作的正确工具？

适合此工作的工具是 monadic解析器。

{-# LANGUAGE TypeFamilies #-}

import Text.Megaparsec
import Text.Megaparsec.Char
import Replace.Megaparsec
import Data.Void
import Data.Either

-- | Invert a single-token parser “character class”.
-- | For example, match any single token except a letter or whitespace: 
-- |
-- |     anySingleExcept (letterChar <|> spaceChar)
-- |
anySingleExcept :: (MonadParsec e s m, Token s ~ Char) => m (Token s) -> m (Token s)
anySingleExcept p = notFollowedBy p *> anySingle

nonSpaceQuoted :: Parsec Void String String
nonSpaceQuoted = 
  ((chunk "\\\"") *> manyTill anySingle (chunk "\\\"")) -- match anything between escaped quotes
  <|> -- or
  (pure <$> anySingleExcept spaceChar) -- match anything that's not a space

wordsQuoted :: Parsec Void String String
wordsQuoted = fst <$> match (some nonSpaceQuoted)

input = "spark.yarn.memoryOverhead=3g spark.executor.extraJavaOptions=\\\"-verbose:gc -XX:+UseSerialGC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\\\" spark.default.parallelism=1000 spark.yarn.executor.memoryOverhead=2000"

putStrLn $ unlines $ fmap ("--conf " <>) $ rights $ splitCap wordsQuoted input

为清楚起见，以下是输出，以unlines而不是unwords打印：

--conf spark.yarn.memoryOverhead=3g
--conf spark.executor.extraJavaOptions=\"-verbose:gc -XX:+UseSerialGC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\"
--conf spark.default.parallelism=1000
--conf spark.yarn.executor.memoryOverhead=2000

在空格上拆分而忽略转义引号

1 个答案: