使用C#提取Web托管PDF的文本内容?

时间:2017-11-08 21:48:16

标签: c# pdf asp.net-mvc-5

在C#( ASP.NET MVC5 )中,我只需要能够从网络托管的 PDF中提取文本内容并将其作为字符串返回。

我看到很多(可能是旧的)如何使用本地文件执行此操作的示例,但没有一个是由Web托管的。

有人有什么想法吗?

1 个答案:

答案 0 :(得分:1)

The thing about web-hosted files is that you cannot see their contents unless your machine has a copy of that file. Even when you open a PDF file in your browser, it still downloads it to your machine, even if temporarily.

Therefore, a program cannot read a file it does not have.

So, you need to download the file into your filesystem, then reference it.

You could use the WebClient class to accomplish this:

using System.Net;
//...
WebClient Client = new WebClient ();
Client.DownloadFile("http://website.com/mypdf.pdf", @"filepath.pdf");

From there, you can use one of those algorithms on "filepath.pdf", display the text, then delete that file.

Note: Webclient is disposable. Make sure to dispose of it or make use of the using keyword.

Fair warning: I'm not a security expert, but I would try to find ways to ensure the files aren't malicious, and ensure your PDF Reader algorithm accounts for this, or your application is specific to websites you know don't host malware.