我正在尝试编写一个应用程序,该应用程序将读取大约30,000多个PDF文件,并尝试屏蔽社会安全号码并将其移至输出目录。该过程正确运行,但输出文件已损坏。当我用Notepad ++打开它时,我注意到一些字符无效,我猜它与VB.Net字符串的编码和PDF文件的编码有关。 (PS:正则表达式来自notepad ++,所以我不确定VB.Net中的语法是否相同)
有没有更简单的方法来做我正在尝试做的事情,或者是否有一种方法我必须将正在读取的文本转换为将保留无法识别的字符的格式?
以下是一些混乱的字符样本:
正确
%âãÏÓ
IDÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿüÿÿðÿÿ€ÿþ
为
%����
IDï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï ¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
Private Sub ProcessFile(ByVal strFilePath As String, ByVal strFileName As String)
Dim strOutFile As String = Replace(strFilePath, strFileName, "") & "out\" & strFileName
Dim read As New StreamReader(strFilePath)
Dim contents As Char() = read.ReadToEnd
read.Close()
contents = Regex.Replace(contents, "(.*)([0-9][0-9][0-9])-([0-9][0-9])-(.*)", "\1XXX-XX-\4")
Dim objReader As New StreamWriter(strOutFile, True)
objReader.WriteLine(contents)
objReader.Close()
End Sub