Text from image documents must be recognized for its usage. Various tasks such as plagiarism & error check, language analysis, information capture rely on the accuracy of this text conversion. OCR systems convert the document images to their text equivalent. These OCR systems are prone to introducing errors during the recognition process.This work reports a system developed to ingest image documents which is converted to text using available OCR technologies. The recognized text, subsequently, is processed with deep network language models to enhance the accuracy of text. The system consists of a client server architecture with user interface available from web application as well as from mobile app. For the language models, encoder-decoder based BART & MarianMT are used. The results obtained demonstrate a 35% reduction in WER using the BART language model.
Text is essential for communication, information sharing, knowledge acquisition, and analysis. It shapes public opinion, supports education, and drives online content, making it crucial in various domains. While there are various language models utilized for text analysis and text correction, there is little to no survey conducted on these model’s behavior and limitations. This work deals with studying BART and MarianMT language models behavior to an input dataset consisting of two types of errors, Synthetic and Natural. Synthetic errors are efficient to create and test, whereas Natural errors are more common and close to real world occurring errors. The models were trained and tested with the generated data, the results highlighted that BART exhibited consistent outputs towards both Synthetic and Natural errors and hence revealing a break-even point at the vicinity of 26% Synthetic error introduction. Conversely, the performance of MarianMT was comparatively diminished for Synthetic errors in contrast to Natural errors. These findings provide valuable insights into the behavior and capabilities of the models.
Text continues to remain a relevant form of representation for information. Text documents are created either in digital native platforms or through conversion of other media files such as images and speech. While the digital native text is invariably obtained through physical or virtual keyboards, technologies such as OCR & speech recognition are utilized to transform the images and speech signals to text content. All these variety of mechanisms of text generation also introduce error into the captured text.
This project aims at analyzing different kinds of errors that occurs in text documents. The work employs two of the advanced deep neural network based language models, namely, BART and MarianMT, for rectifying the anomalies present in text. Transfer learning of these models with available dataset is performed to finetune their capacity for error correction. A comparative study is conducted to investigate the effectiveness of these models in handling each of the defined error categories. It is observed that while both the models are able to bring down the erroneous sentences by 20+%, BART is able to handle spelling errors far better (24.6%) than grammatical errors (8.8%).