如何通过C#读取存储在访问数据库中“OLE对象”字段中的word文档时删除垃圾字符?

时间:2012-04-03 11:12:10

标签: ms-access ms-access-2007

我正在通过Ms Access访问C#数据库。我能够阅读所有字段。我得到的问题是,在读取存储在表的.txt字段中的.docOLE Object文件时,许多额外的垃圾字符也会在读取之前和之后被读取。实际文字如ÿÿÿÿ‡€ ÿÿÿÿÿÿÿÿˆ ÿÿÿÿÿÿÿÿ€ ˆˆˆˆˆˆˆˆ€ ÿÿÿÿÿÿÿÿþ
i 8 @ñÿ 8 N o r m a l CJ _H aJ mH sH tH < A@òÿ¡ <
D e f a u l t P a r a g r a p h F o n t … ÿÿÿÿ ( f p ³ ú ÿ A Ä M • À ' n ­ î 0 q Œ Ï

我的C#代码就像 - `

/*Read from the query and write in a temporary file*/
var oleBytes = (Byte[])Cmd.ExecuteScalar();
MemoryStream ms = new MemoryStream();
ms.Write(oleBytes, 0, oleBytes.Length - 0);
var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
 {
    var buffer = ms.GetBuffer();
    fileStream.Write(buffer, 0, (int)ms.Length);
 }

`

然后像word文档一样阅读这个临时文件 - `

Microsoft.Office.Interop.Word.ApplicationClass wordObject = new ApplicationClass();
object fpath = file; //this is the path
object nullobject = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open
(ref fpath, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject);

docs.ActiveWindow.Selection.WholeStory();

docs.ActiveWindow.Selection.Copy();

IDataObject iData = Clipboard.GetDataObject();

if (iData != null)
  data = iData.GetData(DataFormats.Text).ToString();

`

不知道出了什么问题?我是否也从表中读取字段元数据?如果是这样如何避免呢?读取存储图像以外文件的OLE Object字段的有效方法是什么?

2 个答案:

答案 0 :(得分:3)

我找到了word文档(.doc文件)的解决方案。 Ms Access中的OLE对象存储在实际数据之前包含一些头信息,因此简单地将字段内容提取为字节数组并将其保存到磁盘不起作用。任何OLE对象文件都有一些标准签名。对于word文档,OLEheaderLength is 85 bytes。所以我从字节数组的两端剥离了85个字节,如 -

Con.Open();
string _query="select licenseDoc from Products where ID=56";
//Column licenseDoc contains word and text douments as OLE Objects
OleDbCommand Cmd = new OleDbCommand(_query, Con);

const int offset =85;
var oleBytes = (Byte[])Cmd.ExecuteScalar();
MemoryStream ms = new MemoryStream();
ms.Write(oleBytes, offset, oleBytes.Length - offset);

var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
  var buffer = ms.GetBuffer();
  fileStream.Write(buffer, 0, (int)ms.Length);
}

变量file将包含.tmp文件的路径,其中包含从存储为OLE object in Ms Access的word文档中读取的数据。此文件可以word document直接打开,也可以更改.doc

其他格式的OLEheaderLength如下:

1] JPEG/JPG=224
2] BMP=78
3] PDF=85
4] SNP=74
5] DOC=85/90
6] DOCX=87

我不知道OLEheaderLength的{​​{1}}。不幸的是,上述解决方案仅适用于.txt(Simple Text) files个文件。但是当谈到.doc文件和任何其他文件格式时,它会失败。

为了找出ole标头的长度,你可以简单地使用从这里解释和下载的库 - http://jvdveen.blogspot.in/2009/02/ole-and-accessing-files-embedded-in.html

答案 1 :(得分:0)

我尝试打开DOCX(.docx)&amp; Notepad++中的PDF文件,发现奇怪但标准的BOF(Beginning Of File)&amp; EOF(End Of File)字符串模式。然后我找到了一个从Ms Access DB中提取DOCX(.docx)文件的解决方案。对于.docx个文件,OLEheaderLength为87个字节。

Con.Open();
string _query="select licenseDoc from Products where ID=56";
//Column licenseDoc contains word douments as OLE Objects
OleDbCommand Cmd = new OleDbCommand(_query, Con);

var oleBytes = (Byte[])Cmd.ExecuteScalar();

const string START_BLOCK = "PK";//DOCX files starts with "PK"
const string END_BLOCK = "PK";//DOCX files ends with "PK" followed by some fixed 20 blank chars
int startPos = -1;
int endpos = -1;

Encoding ascii = Encoding.ASCII;
string strEncoding = ascii.GetString(oleBytes);
if (strEncoding.IndexOf(START_BLOCK) != -1 && strEncoding.LastIndexOf(END_BLOCK) != -1)
{
     startPos = strEncoding.IndexOf(START_BLOCK);
     endpos = strEncoding.LastIndexOf(END_BLOCK) + END_BLOCK.Length + 20;
}
if (startPos == -1)
{
     throw new Exception("Could not find DOCX Header");
}

byte[] retByte = new byte[endpos - startPos];

Array.Copy(oleBytes , startPos, retByte, 0, endpos - startPos);

MemoryStream ms = new MemoryStream();
ms.Write(retByte, 0, retByte.Length);

var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
  var buffer = ms.GetBuffer();
  fileStream.Write(buffer, 0, (int)ms.Length);
}

变量file将包含.tmp文件的路径,该文件包含从存储为Ms Access中的OLE对象的word文档中读取的数据。此文件可以作为word文档直接打开,也可以将其扩展名更改为.docx

对于PDF文件,发现OLEheaderLength为85或90。 我没有尝试过这个用于PDF,但您可以尝试使用 -

const string START_BLOCK = "%PDF";//PDF files starts with "%PDF"
const string END_BLOCK = "%EOF";//PDF files ends with "%EOF" followed by some fixed 20 blank chars

为了找出ole标头的长度,你可以简单地使用从这里解释和下载的库 - http://jvdveen.blogspot.in/2009/02/ole-and-accessing-files-embedded-in.html