我是Haskell的新手,想要解决我的问题。我希望有一个文本编码功能列表,其中文本的每个单词由其索引表示。对于例如:
["The more I like, the more I love.","The more I love, the more I hate."]
输出可能是
(["The", "more", "I", "like", "the", "love.", "love,", "hate."],
[1, 2, 3, 4, 5, 2, 3, 6, 1, 2, 3, 7, 1, 2, 3, 8])
我已完成删除重复部分
removeDuplicates :: Eq a => [a] -> [a]
removeDuplicates = rdHelper []
where rdHelper seen [] = seen
rdHelper seen (x:xs)
| x `elem` seen = rdHelper seen xs
| otherwise = rdHelper (seen ++ [x]) xs
答案 0 :(得分:1)
您可以迭代单词列表并累积唯一单词及其索引。如果元素在累积列表中,则将索引附加到累积的索引列表。如果元素不在列表中,则追加新索引(单词列表的长度+ 1)。
说实话,Haskell
代码比我的描述更容易理解:
import Data.List (findIndex)
build :: ([String], [Int]) -> String -> ([String], [Int])
build (words, indexes) word =
let
maybeIndex = findIndex (== word) words
in
case maybeIndex of
Just index ->
(words, indexes ++ [index + 1])
Nothing ->
(words ++ [word], indexes ++ [(+1) . length $ words])
buildIndexes =
let
listOfWords = words "The more I like, the more I love. The more I love, the more I hate."
in
foldl build ([], []) listOfWords
这里我有一个串联的字符串作为输入
"The more I like, the more I love. The more I love, the more I hate."
随意根据您的需求定制代码。
顺便说一句,在列表的开头插入元素然后反转结果列表可能会更高效。
import Data.List (findIndex)
build :: ([String], [Int]) -> String -> ([String], [Int])
build (words, indexes) word =
let
maybeIndex = findIndex (== word) words
in
case maybeIndex of
Just index ->
(words, (index + 1) : indexes)
Nothing ->
(word : words, ((+1) . length $ words) : indexes)
buildIndexes =
let
listOfWords = words "The more I like, the more I love. The more I love, the more I hate."
(listOfUniqueWords, listOfIndexes) = foldl build ([], []) listOfWords
in
(reverse listOfUniqueWords, reverse listOfIndexes)
答案 1 :(得分:1)
我认为Data.Map
和Data.Set
包是有效解决此问题的理想工具。我的实施如下:
import qualified Data.Map.Lazy as Map
import qualified Data.Set as Set
encode :: [String] -> ([String],[[Int]])
encode wss = let dict = Map.fromList . zip (Set.toList . Set.unions . map (Set.fromList . words) $ wss) $ [1..]
in (map fst $ Map.toList dict, map (map (flip (Map.findWithDefault 0) dict) . words) wss)
*Main> encode ["Are you allright", "Hey there how are you", "Hello there", "Do you like coffee"]
(["Are","Do","Hello","Hey","allright","are","coffee","how","like","there","you"],[[1,11,5],[4,10,8,6,11],[3,10],[2,11,9,7]])