How can you reduce storage costs by automatically categorising documents?
Discover how Coexya’s experts helped a major energy player optimize their document management and storage costs.
Automatic document categorisation — a concrete lever for reducing costs
Data volumes in organisations increased fivefold between 2020 and 2025, with an average annual growth rate of 35%. This exponential growth creates three major problems: data obsolescence from unpurged files, growing complexity around GDPR and regulatory compliance, and accumulating storage costs — hardware, backups, licences, and duplicate management. Automatic categorisation of documents, whether structured or unstructured, based on their type and content, directly addresses all three challenges.
The client case: a major player in the energy sector
An energy sector organisation approached Coexya to design a solution for optimising its document storage costs. The principle adopted was straightforward: document type determines retention period. The project therefore involved automating the classification of documents according to a ten-category classification plan — identity documents (10 years), contracts (15 years), training materials (5 years), and so on — in order to automatically derive the applicable retention period for each file.
The Coexya approach in 6 steps
Coexya’s Search & Semantics experts deployed a supervised machine learning model, integrating the Sinequa platform for OCR, model training and application. The initial corpus comprised approximately 1,000 manually annotated documents, split into a training corpus (70%), an evaluation corpus (30%), and an application corpus. The project was completed in under two months — one month for implementation and three weeks for the evaluation phase — for a total effort of approximately 35 person-days.
Measurable results: 80% of categories achieving an F1-score above 80%
Evaluation of the model on a corpus of 273 documents showed that 80% of categories achieve an F1-score above 80%. Precision reaches 91% for documents classified with a confidence level above 30%, which represents 77% of the total volume. The model improves continuously: documents classified with insufficient confidence are redirected for manual annotation and then reintegrated into the training corpus.
A publication by the Search & Semantics experts at Coexya:
Jean-Louis Vila, CTO — Gaël Yvrard, Project Director — Pierre Martin, Sales Engineer