Haskell文本编码器

时间:2017-08-02 06:09:07

标签: haskell encoding

我是Haskell的新手,想要解决我的问题。我希望有一个文本编码功能列表,其中文本的每个单词由其索引表示。对于例如:

["The more I like, the more I love.","The more I love, the more I hate."]

输出可能是

   (["The", "more", "I", "like", "the", "love.", "love,", "hate."],
   [1, 2, 3, 4, 5, 2, 3, 6, 1, 2, 3, 7, 1, 2, 3, 8])

我已完成删除重复部分

removeDuplicates :: Eq a => [a] -> [a]
removeDuplicates = rdHelper []
  where rdHelper seen [] = seen
          rdHelper seen (x:xs)
            | x `elem` seen = rdHelper seen xs
            | otherwise = rdHelper (seen ++ [x]) xs

2 个答案:

答案 0 :(得分:1)

您可以迭代单词列表并累积唯一单词及其索引。如果元素在累积列表中,则将索引附加到累积的索引列表。如果元素不在列表中,则追加新索引(单词列表的长度+ 1)。

说实话,Haskell代码比我的描述更容易理解:

import Data.List (findIndex)

build :: ([String], [Int]) -> String -> ([String], [Int])
build (words, indexes) word =
  let
    maybeIndex = findIndex (== word) words
  in
    case maybeIndex of
      Just index ->
        (words, indexes ++ [index + 1])
      Nothing ->
        (words ++ [word], indexes ++ [(+1) . length $ words])

buildIndexes =
  let
    listOfWords = words "The more I like, the more I love. The more I love, the more I hate."
  in
    foldl build ([], []) listOfWords

这里我有一个串联的字符串作为输入

"The more I like, the more I love. The more I love, the more I hate."

随意根据您的需求定制代码。

顺便说一句,在列表的开头插入元素然后反转结果列表可能会更高效。

import Data.List (findIndex)

build :: ([String], [Int]) -> String -> ([String], [Int])
build (words, indexes) word =
  let
    maybeIndex = findIndex (== word) words
  in
    case maybeIndex of
      Just index ->
        (words, (index + 1) : indexes)
      Nothing ->
        (word : words, ((+1) . length $ words) : indexes)

buildIndexes =
  let
    listOfWords = words "The more I like, the more I love. The more I love, the more I hate."
    (listOfUniqueWords, listOfIndexes) = foldl build ([], []) listOfWords
  in
    (reverse listOfUniqueWords, reverse listOfIndexes)

答案 1 :(得分:1)

我认为Data.MapData.Set包是有效解决此问题的理想工具。我的实施如下:

import qualified Data.Map.Lazy as Map
import qualified Data.Set as Set

encode :: [String] -> ([String],[[Int]])
encode wss = let dict = Map.fromList . zip (Set.toList . Set.unions . map (Set.fromList . words) $ wss) $ [1..]
             in (map fst $ Map.toList dict, map (map (flip (Map.findWithDefault 0) dict) . words) wss)

*Main> encode ["Are you allright", "Hey there how are you", "Hello there", "Do you like coffee"]
(["Are","Do","Hello","Hey","allright","are","coffee","how","like","there","you"],[[1,11,5],[4,10,8,6,11],[3,10],[2,11,9,7]])