PDF / TIFF文档文本检测gcsDestinationBucketName

时间:2019-05-22 12:39:34

标签: c# asp.net google-cloud-vision google-cloud-visualstudio

我正在使用Google Cloud vision API进行Pdf到文本文件的转换。

我从那边获得了初始代码帮助,通过注册和激活获得的json密钥,图像到文本的转换效果很好,

这是我将pdf转换为文本的代码

private static object DetectDocument(string gcsSourceUri,
string gcsDestinationBucketName, string gcsDestinationPrefixName)
{
var client = ImageAnnotatorClient.Create();

var asyncRequest = new AsyncAnnotateFileRequest
{
    InputConfig = new InputConfig
    {
        GcsSource = new GcsSource
        {
            Uri = gcsSourceUri
        },
        // Supported mime_types are: 'application/pdf' and 'image/tiff'
        MimeType = "application/pdf"
    },
    OutputConfig = new OutputConfig
    {
        // How many pages should be grouped into each json output file.
        BatchSize = 2,
        GcsDestination = new GcsDestination
        {
            Uri = $"gs://{gcsDestinationBucketName}/{gcsDestinationPrefixName}"
        }
    }
};

asyncRequest.Features.Add(new Feature
{
    Type = Feature.Types.Type.DocumentTextDetection
});

List<AsyncAnnotateFileRequest> requests =
    new List<AsyncAnnotateFileRequest>();
requests.Add(asyncRequest);

var operation = client.AsyncBatchAnnotateFiles(requests);

Console.WriteLine("Waiting for the operation to finish");

operation.PollUntilCompleted();

// Once the rquest has completed and the output has been
// written to GCS, we can list all the output files.
var storageClient = StorageClient.Create();

// List objects with the given prefix.
var blobList = storageClient.ListObjects(gcsDestinationBucketName,
    gcsDestinationPrefixName);
Console.WriteLine("Output files:");
foreach (var blob in blobList)
{
    Console.WriteLine(blob.Name);
}

// Process the first output file from GCS.
// Select the first JSON file from the objects in the list.
var output = blobList.Where(x => x.Name.Contains(".json")).First();

var jsonString = "";
using (var stream = new MemoryStream())
{
    storageClient.DownloadObject(output, stream);
    jsonString = System.Text.Encoding.UTF8.GetString(stream.ToArray());
}

var response = JsonParser.Default
            .Parse<AnnotateFileResponse>(jsonString);

// The actual response for the first page of the input file.
var firstPageResponses = response.Responses[0];
var annotation = firstPageResponses.FullTextAnnotation;

// Here we print the full text from the first page.
// The response contains more information:
// annotation/pages/blocks/paragraphs/words/symbols
// including confidence scores and bounding boxes
Console.WriteLine($"Full text: \n {annotation.Text}");

return 0;
}

此功能需要3个参数 字符串gcsSourceUri, 字符串gcsDestinationBucketName, 字符串gcsDestinationPrefixName

我不知道应该为这3个参数设置哪个值。 我之前从未研究过第三方api,因此有点困惑

1 个答案:

答案 0 :(得分:1)

假设您拥有一个名为“ giri_bucket”的GCS存储桶,并且在存储桶“ test.pdf”的根目录下放置了一个pdf文件。如果您要将运算结果写入同一存储桶,则可以将参数设置为

  • gcsSourceUri:'gs://giri_bucket/test.pdf'
  • gcsDestinationBucketName:'giri_bucket'
  • gcsDestinationPrefixName:'async_test'

操作完成后,GCS存储桶中的giri_bucket / async_test将有1个或多个输出文件。

如果需要,您甚至可以将输出写入其他存储桶。您只需要确保gcsDestinationBucketName + gcsDestinationPrefixName是唯一的即可。

您可以在文档中进一步了解请求格式:AsyncAnnotateFileRequest