DC 5.9 Search in asset content (OCR)
It is possible extract the text content of PDFs and images using Optical Character Recognition (OCR). The content can then be included in searches.
This page describes how to enable search in asset content. The setup consists of two steps:
Content extraction with Microsoft Azure Cognitive Services.
Including asset content in searches.
Please be aware of the important information at the bottom of the page.
Content extraction with Microsoft Azure Cognitive Services
The content extraction of PDFs and images relies on Microsoft Azure Cognitive Services. Thus, a Computer Vision resource in Azure must be used.
A new Computer Vision client can be created with the following steps:
Log in to the Azure portal (https://portal.azure.com/).
Search for āCognitive Servicesā
Click āAddā to add a new Cognitive Service.
Search for āComputer Visionā and create a new client.
The Computer Vision resource has a key and a server URL, which will be needed shortly. These can be found by navigating to the Computer Vision resource and locating āKeys and Endpointsā.
Content extraction can now be enabled in the Cognitive Service, which is part of Digizuite Core.
On the server where the DAM Center is installed, navigate to the Cognitive Service directory (typically āWebs/<yourDAM>/DigizuiteCore/cognitiveserviceā).
Edit the āappsettings.jsonā-file. The following parameters in the āComputerVisionDetailsā-section are relevant:
Parameter | Description |
---|---|
OcrKey | The key from the Computer Vision resource created above. (One of the āKEYā entries in the image above) |
OcrServerUri | The URI from the Computer Vision resource created above. (The āEndpointā entry in the image above) |
OcrExtractFromPdf | If true, the text content of PDF files is extracted when new PDF files are uploaded to the DAM. |
OcrExtractFromImage | If true, the text content of images is extracted when new images are uploaded to the DAM. |
OcrLetAzureRequestFiles | If false, files are explicitly uploaded to the Computer Vision client. Otherwise, Azure will request the files from the DAM Center. Setting this to true is expected to be more efficient, but it requires that the DAM Center can be accessed by Azure. Thus, ensure that the DAM Center is not behind a strict firewall, if this is set to true. |
OcrTaskDelayLength | We regularly check the status of ongoing content extractions in the Computer Vision client. This gives the time interval between each check. The larger the time interval is, the less requests are made to Azure. However, it then also takes more time for the extracted content of the files to be available in the āAsset Contentā metafield. You most likely donāt have to change this. |
Once the information has been provided, and the āappsettings.jsonā-file has been saved, the content of PDFs and/or images is extracted when PDFs/images are uploaded.
When the content of an asset has been extracted with Microsoft Azure Cognitive Services, it is automatically written to the metafield āAsset contentā.
The metafield āAsset content' is automatically created when installing or upgrading to 5.5. The field is created in metagroup 'Contentā, and it is very important that this exact field is used in the configuration as the GUID of the metadata field is used as a dependency in the system.
GUID: 4A8ED71B-574A-43BB-A35E-8826598CF36F
Including asset content in searches (when using Solr)
The extracted content of assets can be included in freetext searches by adding the metafield āAsset content' in the search DigiZuite_System_Framework_Search as a freetext input parameter. The 'Asset contentā metafield can be added as a freetext input parameter by following these steps:
Find DigiZuite_System_Framework_Search in the ConfigManager for the correct version of the product the feature should be enabled for.
Add a new input parameter.
Locate and choose the metafield group āContentā.
Choose the metafield āAsset contentā, and choose the āFreeTextā comparison type. Create the input parameter.
Save the modified search and populate the search cache.
The content of the asset types you selected in Step 1 should now be included in freetext searches when new assets are uploaded.
The content of existing assets can be extracted by republishing the assets.
Including asset content in searches (when using ElasticSearch in MM 5.6+)
In Media Manager, go into āGeneral settings ā āAsset searchā, and then add the field āAsset contentā to 'Freetext search fieldsā.
If the āAsset contentā field is not available, make sure itās available from the metadata editor in Media Manager and readable for all the people that need to search for it. Otherwise, the Search Engine will not index it.
Important information
Please be aware of the following when using the Computer Vision resource:
The content extraction has some limitations (see https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text). In particular, be aware of the following limitations:
Supported file formats: JPEG, PNG, BMP, PDF and TIFF.
For PDF and TIFF files, up to 2000 pages (only first two pages for the free tier) are processed.
The file size must be less than 50 MB (4 MB for the free tier) and dimensions at least 50 x 50 pixels and at most 10000 x 10000 pixels.
The PDF dimensions must be at most 17 x 17 inches, corresponding to legal or A3 paper sizes and smaller.
If an error occurs, an error message will end up in the CognitiveServices.error queue in RabbitMQ. You can retry the content extraction by moving the failed message to the CognitiveServices queue. Please be aware of the limitations above before retrying.
There is a cost associated with extracting content from files. Please see https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/ for more details.
If the extracted content is included in freetext searches, it is searched on equal terms with the asset titles. Thus, it might be more difficult to find an asset by title.
Each time an asset is republished, its content is extracted again.
Repopulating search caches might take significantly more time, if content extraction has been enabled.
Freetext searches might be slower when asset content is included. However, we do not expect this to be noticeable.
The DigizuiteCore service CognitiveService must be able to make HTTP calls to Azure over https. If you run into issues, please ensure that outgoing https requests are not blocked by a firewall.
Ā