Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

It is possible extract the text contents of PDFs and images using Optical Character Recognition (OCR). The contents can then be included in searches.

...

Parameter

Description

OcrKey

The key from the Computer Vision resource created above. (The “KEY 1” entry One of the “KEY” entries in the image above)

OcrServerUri

The URI from the Computer Vision resource created above. (The “Endpoint” entry in the image above)

OcrExtractFromPdf

If true, the text contents of PDF files are extracted when the new PDF files are uploaded to the DAM.

OcrExtractFromImage

If true, the text contents of images are extracted when the new images are uploaded to the DAM.

OcrLetAzureRequestFiles

If false, we files are explicitly upload files uploaded to the Computer Vision client. Otherwise, Azure will request the files from the DAM Center. Setting this to true is expected to be more efficient, but it requires that the DAM Center can be accessed by Azure.

Thus, ensure that the DAM Center is not behind a strict firewall if this is set to true.

OcrTaskDelayLength

We regularly check the status of ongoing content extractions in the Computer Vision client. This gives the time interval between each check.

The larger the time interval is, the less requests are made to Azure. However, it then also takes more time for the extracted contents of files to be available in the “Asset Content” metafield.

You most likely don’t have to change this.

...

The extracted contents of assets can be included in freetext searches by adding the metafield “Asset Content” content” in the search “DigiZuite_System_Framework_Search“ as a freetext input parameter therefore includes the extracted contents of assets in freetext searches. The “Asset content” metafield can be added as a freetext input parameter by doing the following:

  1. Find “DigiZuite_System_Framework_Search“ in the ConfigManager for the product version of the product to enable this feature for. E.g. the product version of the MM5.

  2. Add a new input parameter.

    1. Locate and choose the metafield group “Content”.

    2. Choose the metafield “Asset content”, and choose the “FreeText” comparison type. Create the input parameter.

  3. Save the modified search and populate the search cache.

...

The contents of the asset types you selected in step 1 should now be included in freetext searches when new assets are uploaded.

Info

The contents of existing assets can be extracted by republishing the assets.

Microsoft Azure Cognitive Service - Information

...

  • The content extraction has some limitations (see https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text). In particular, be aware of the following limitations:

    • Supported file formats: JPEG, PNG, BMP, PDF, and TIFF.

    • For PDF and TIFF files, up to 2000 pages (only first two pages for the free tier) are processed.

    • The file size must be less than 50 MB (4 MB for the free tier) and dimensions at least 50 x 50 pixels and at most 10000 x 10000 pixels.

    • The PDF dimensions must be at most 17 x 17 inches, corresponding to legal or A3 paper sizes and smaller.

  • If an error occurs, an error message will end up in the CognitiveServices.error queue in RabbitMQ. You can retry the content extraction by moving the failed message to the CognitiveServices queue.

  • There is a cost associated with extracting contents from files. Please see https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/ for more details.

  • If the extracted contents are included in freetext searches, they are searched on equal terms with the asset titles. Thus, it might be more difficult to find an asset by title.

  • Repopulating search caches and assets might take significantly more time if content extraction has been enabled.

  • Freetext searches might be slower when asset contents are included. However, we do not expect this to be significant.