Question

我从服务器收到一个多部分文件，我需要从中选择pdf部分。我尝试用

删除前x行和后2行

$content=Get-Content $originalfile
$content[0..($content.length-3)] |$outfile

但是它破坏了二进制数据，那么从文件中获取二进制部分的方法是什么？

MIME-Version: 1.0
Content-Type: multipart/related; boundary=MIME_Boundary; 
    start="<6624867311297537120--4d6a31bb.16a77205e4d.3282>"; 
    type="text/xml"

--MIME_Boundary
Content-ID: <6624867311297537120--4d6a31bb.16a77205e4d.3282>
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 8bit

<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Body xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"/>
--MIME_Boundary
Content-ID: 
Content-Type: application/xml
Content-Disposition: form-data; name="metadata"

<?xml version="1.0" encoding="ISO-8859-1"?>
<metadata><contentLength>64288</contentLength><etag>7e3da21f7ed1b434def94f4b</etag><contentType>application/octet-stream</contentType><properties><property><key>Account</key><value>finance</value></property><property><key>Business Unit</key><value>EU DEBMfg</value></property><property><key>Document Type</key><value>PAYABLES</value></property><property><key>Filename</key><value>test-pdf.pdf</value></property></properties></metadata>
--MIME_Boundary
Content-ID: 
Content-Type: application/octet-stream
Content-Disposition: form-data; name="content"

%PDF-1.6
%âãÏÓ
37 0 obj <</Linearized 1/L 20597/O 40/E 14115/N 1/T 19795/H [ 1005 215]>>
endobj

xref
37 34
0000000016 00000 n
0000001386 00000 n
0000001522 00000 n
0000001787 00000 n
0000002250 00000 n
.
.
.
0000062787 00000 n
0000063242 00000 n
trailer
<<
    /Size 76
    /Prev 116
    /Root 74 0 R
    /Encrypt 38 0 R
    /Info 75 0 R
    /ID [ <C21F21EA44C1E2ED2581435FA5A2DCCE> <3B7296EB948466CB53FB76CC134E3E76> ]
>>
startxref
63926
%%EOF

--MIME_Boundary-

Answer 1

您需要将文件读取为一系列字节，并将其视为二进制文件。接下来，要解析出文件的PDF部分，您需要再次将其读取为String，以便可以对其执行正则表达式。

字符串应采用不会以任何方式改变字节的编码，为此，有一种特殊的编码Codepage 28591 (ISO 8859-1)，原始文件中的字节按原样使用。

为此，我编写了以下帮助函数：

function ConvertTo-BinaryString {
    # converts the bytes of a file to a string that has a
    # 1-to-1 mapping back to the file's original bytes. 
    # Useful for performing binary regular expressions.
    Param (
        [Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
        [ValidateScript( { Test-Path $_ -PathType Leaf } )]
        [String]$Path
    )

    $Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'

    # Note: Codepage 28591 (ISO 8859-1) returns a 1-to-1 char to byte mapping
    $Encoding     = [Text.Encoding]::GetEncoding(28591)
    $StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
    $BinaryText   = $StreamReader.ReadToEnd()

    $StreamReader.Close()
    $Stream.Close()

    return $BinaryText
}

使用上述功能，您应该能够从多部分文件中获取二进制部分，如下所示：

$inputFile  = 'D:\blah.txt'
$outputFile = 'D:\blah.pdf'

# read the file as byte array
$fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
# and again as string where every byte has a 1-to-1 mapping to the file's original bytes
$binString = ConvertTo-BinaryString -Path $inputFile

# create your regex, all as ASCII byte characters: '%PDF.*%%EOF[\r?\n]{0,2}'
$regex = [Regex]'(?s)(\x25\x50\x44\x46[\x00-\xFF]*\x25\x25\x45\x4F\x46[\x0D\x0A]{0,2})'
$match = $regex.Match($binString)

# use a MemoryStream object to store the result
$stream = New-Object System.IO.MemoryStream
$stream.Write($fileBytes, $match.Index, $match.Length)

# save the binary data of the match as a series of bytes
[System.IO.File]::WriteAllBytes($outputFile, $stream.ToArray())

# clean up
$stream.Dispose()

正则表达式详细信息：

(                 Match the regular expression below and capture its match into backreference number 1
   \x25           Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
   \x50           Match the ASCII or ANSI character with position 0x50 (80 decimal => P) in the character set
   \x44           Match the ASCII or ANSI character with position 0x44 (68 decimal => D) in the character set
   \x46           Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
   [\x00-\xFF]    Match a single character in the range between ASCII character 0x00 (0 decimal) and ASCII character 0xFF (255 decimal)
      *           Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \x25           Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
   \x25           Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
   \x45           Match the ASCII or ANSI character with position 0x45 (69 decimal => E) in the character set
   \x4F           Match the ASCII or ANSI character with position 0x4F (79 decimal => O) in the character set
   \x46           Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
   [\x0D\x0A]     Match a single character present in the list below
                      ASCII character 0x0D (13 decimal)
                      ASCII character 0x0A (10 decimal)
      {0,2}       Between zero and 2 times, as many times as possible, giving back as needed (greedy)
)

如何使用Powershell从多部分文件中选择二进制部分？

1 个答案: