...
It is possible extract the text contents of PDFs and images using Optical Character Recognition (OCR). The contents content can then be included in searches.
This page describes how to enable search in asset contentscontent. The setup consists of two steps:
...
Please be aware of the important information in the bottom.
...
Content extraction with Microsoft Azure Cognitive Services
The content extraction of PDFs and images relies on Microsoft Azure Cognitive Services. Thus, a Computer Vision resource in Azure must be used.
...
The Computer Vision resource has a key and a server URIURL, which will be needed shortly. These can be found by navigating to the Computer Vision resource and locating “Keys and Endpoints”.
...
When the contents of an asset have been extracted with Microsoft Azure Cognitive Services, they are automatically written to the metafield “Asset content”.
Info |
---|
The metafields metafield “Asset content” and “Asset content concurrency token” are predefined and should not be manually modified if asset contents should be made searchable. |
...
is automatically created when installing or upgrading to 5.5. The field is created in metagroup “Content”, and it is very important that this exact field is used in the configuration as the GUID of the metadata field is used as a dependency in the system. |
Including asset content in searches
The extracted contents of assets can be included in freetext searches by adding the metafield “Asset content” in the search “DigiZuite_System_Framework_Search“ as a freetext input parameter. The “Asset content” metafield can be added as a freetext input parameter by doing the following:
...
Info |
---|
The contents of existing assets can be extracted by republishing the assets. |
Important Information
Please be aware of the following when using the Computer Vision resource:
The content extraction has some limitations (see https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text). In particular, be aware of the following limitations:
Supported file formats: JPEG, PNG, BMP, PDF, and TIFF.
For PDF and TIFF files, up to 2000 pages (only first two pages for the free tier) are processed.
The file size must be less than 50 MB (4 MB for the free tier) and dimensions at least 50 x 50 pixels and at most 10000 x 10000 pixels.
The PDF dimensions must be at most 17 x 17 inches, corresponding to legal or A3 paper sizes and smaller.
If an error occurs, an error message will end up in the CognitiveServices.error queue in RabbitMQ. You can retry the content extraction by moving the failed message to the CognitiveServices queue. Please be aware of the limitations above before retrying.
There is a cost associated with extracting contents from files. Please see https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/ for more details.
If the extracted contents are included in freetext searches, they are searched on equal terms with the asset titles. Thus, it might be more difficult to find an asset by title.
Each time an asset is published or republished, its contents are extracted again.
Repopulating search caches and assets might take significantly more time if content extraction has been enabled.
Freetext searches might be slower when asset contents are included. However, we do not expect this to be significantnoticeable.
The DigizuiteCore service CognitiveService must be able to make HTTP calls to Azure over https. If you run into issues, please ensure that outgoing https requests are not blocked by a firewall.