Haskell:向后读取二进制文件

时间:2015-11-17 13:27:09

标签: file haskell binary reverse

我希望在使用Haskell匹配特定模式的uInt32二进制转储中找到最后一个32位字。我可以使用last完成任务,但代码必须遍历整个文件,因此效率很低。

是否有一种简单的方法可以使readfile反向操作文件?我相信这可以解决当前代码变化最小的问题。

这是我目前的代码,仅供参考。我本周末才开始使用Haskell,所以我确信它非常难看。它在MSB处查找以0b10开头的最后32位字。

import System.Environment(getArgs)
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Internal as BL
import qualified Data.ByteString as BS
import Data.Binary.Get
import Data.Word
import Data.Bits
import Text.Printf(printf)

main = do
  args <- getArgs
  let file = args!!0
  putStrLn $ "Find last 0xCXXXXXXX in " ++ file

  content <- BL.readFile file

  let packets = getPackets content
  putStrLn . show . getValue . last . filterTimes $ packets

-- Data

type Packet = Word32

-- filter where first 2 bits are 10
filterTimes :: [Packet] -> [Packet]
filterTimes = filter ((== 0x2) . tag)

-- get the first 2 bits
tag :: Packet -> Packet
tag rp =
  let tagSize = 2
  in  shiftR rp (finiteBitSize rp - tagSize)

-- remove the tag bits
getValue :: Packet -> Packet
getValue =
  let tagSize = 2
      mask    = complement $ rotateR (2^tagSize - 1) tagSize
  in (.&.) mask

-- Input
-- Based on https://hackage.haskell.org/package/binary/docs/Data-Binary-Get.html

getPacket :: Get Packet
getPacket = do
  packet <- getWord32le
  return $! packet

getPackets :: BL.ByteString -> [Packet]
getPackets input0 = go decoder input0
  where
    decoder = runGetIncremental getPacket
    go :: Decoder Packet -> BL.ByteString -> [Packet]
    go (Done leftover _consumed packet) input =
      packet : go decoder (BL.chunk leftover input)
    go (Partial k) input                     =
      go (k . takeHeadChunk $ input) (dropHeadChunk input)
    go (Fail _leftover _consumed msg) _input =
      []

takeHeadChunk :: BL.ByteString -> Maybe BS.ByteString
takeHeadChunk lbs =
  case lbs of
    (BL.Chunk bs _) -> Just bs
    _ -> Nothing

dropHeadChunk :: BL.ByteString -> BL.ByteString
dropHeadChunk lbs =
  case lbs of
    (BL.Chunk _ lbs') -> lbs'
    _ -> BL.Empty

3 个答案:

答案 0 :(得分:1)

对您的代码的一些评论:

  1. 您使用的last可能会引发异常。您应该使用safe包中的lastMay返回Maybe。

  2. 由于您只是将文件视为Word32s的向量,我认为不值得使用Data.Binary.Get及其带来的相关开销和复杂性。只需将文件视为(可能是懒惰的)ByteString并访问每个第4个字节或将其分解为4个字节的子字符串。

  3. 您可以查看使用ByteStrings here的代码。它实现了以下解决问题的方法:

    • 将整个文件作为惰性ByteString读入,并生成一个4字节子串的(惰性)列表。返回满足标准的最后一个子字符串。

      intoWords :: BL.ByteString -> [ BL.ByteString ]
      intoWords bs
        | BL.null a = []
        | otherwise = a : intoWords b
        where (a,b) = BL.splitAt 4 bs
      
      -- find by breaking the file into 4-byte words
      find_C0_v1 :: FilePath -> IO (Maybe BL.ByteString)
      find_C0_v1 path = do
        contents <- BL.readFile path
        return $ lastMay . filter (\bs -> BL.index bs 0 == 0xC0) . intoWords $ contents
      
    • 将整个文件作为惰性ByteString读入,并访问每个第4个字节,查找0xC0。返回最后一次出现。

      -- find by looking at every 4th byte
      find_C0_v2 :: FilePath -> IO (Maybe BL.ByteString)
      find_C0_v2 path = do
        contents <- BL.readFile path
        size <- fmap fromIntegral $ withFile path ReadMode hFileSize
        let wordAt i = BL.take 4 . BL.drop i $ contents
        return $ fmap wordAt $ lastMay $ filter (\i -> BL.index contents i == 0xC0) [0,4..size-1]
      
    • 以64K的块为单位向后读取文件。在每个块(这是一个严格的ByteString)中,每隔4个字节访问一次,从块的末尾开始寻找0xC0。返回第一次出现。

      -- read a file backwords until a predicate returns a Just value
      loopBlocks :: Int -> Handle -> Integer -> (BS.ByteString -> Integer -> Maybe a) -> IO (Maybe a)
      loopBlocks blksize h top pred
        | top <= 0 = return Nothing
        | otherwise   = do
              let offset = top - fromIntegral blksize
              hSeek h AbsoluteSeek offset
              blk <- BS.hGet h blksize
              case pred blk offset of
                Nothing -> loopBlocks blksize h offset pred
                x       -> return x
      
      -- find by reading backwords lookint at every 4th byte
      find_C0_v3 :: FilePath -> IO (Maybe Integer)
      find_C0_v3 path = do
        withFile path ReadMode $ \h -> do
          size <- hFileSize h
          let top = size - (mod size 4)
              blksize = 64*1024 :: Int
          loopBlocks blksize h top $ \blk offset ->
                fmap ( (+offset) . fromIntegral ) $ headMay $ filter (\i -> BS.index blk i == 0xC0) [blksize-4,blksize-8..0]
      

    即使必须在整个文件中读取,第三种方法也是最快的。第一种方法实际上效果很好。我根本不推荐第二种 - 随着文件大小的增加,它的性能会急剧下降。

答案 1 :(得分:0)

对于任何可能感兴趣的人,我已经调整了@ ErikR的答案。这个解决方案遵循他提出的解决方案3,但是利用我现有的代码,通过反向延迟执行块。

这需要一些额外的进口:

import System.IO
import Safe
import Data.Maybe

main变为:

main = do
  args <- getArgs
  let file = args!!0
  putStrLn $ "Find last 0xCXXXXXXX in " ++ file

  -- forward
  withFile file ReadMode $ \h -> do
    content <- BL.hGetContents h
    let packets = getPackets content
    putStrLn . show . getValue . last . filterTimes $ packets

  -- reverse
  withFile file ReadMode $ \h -> do
    size <- hFileSize h
    let blksize = 64*1024 :: Int
    chunks <- makeReverseChunks blksize h (fromIntegral size)
    putStrLn . show . getValue . (fromMaybe 0) . headMay . catMaybes . (map $ lastMay . filterTimes . getPackets) $ chunks

添加辅助功能:

-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [BL.ByteString]
makeReverseChunks blksize h top
  | top == 0 = return []
  | top < 0  = error "negative file index"
  | otherwise   = do
        let offset = max (top - fromIntegral blksize) 0
        hSeek h AbsoluteSeek (fromIntegral offset)
        blk <- BL.hGet h blksize
        rest <- makeReverseChunks blksize h offset
        return $ blk : rest

答案 2 :(得分:0)

以下是函数makeReverseChunks的变体。目前它非常严格。此外,如果将blksize保持足够低,则使用延迟字节串也无济于事。为了实现懒惰阅读,必须使用unsafeInterleaveIO。这是一个使用严格的字节串和惰性IO的解决方案:

-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [SBS.ByteString]
makeReverseChunks blksize h top
  | top == 0 = return []
  | top < 0  = error "negative file index"
  | otherwise   = do
    let offset = max (top - fromIntegral blksize) 0
    hSeek h AbsoluteSeek (fromIntegral offset)
    blk <- SBS.hGet h blksize
    rest <- unsafeInterleaveIO $ makeReverseChunks blksize h offset
    return $ blk : rest