Implementation of Tesseract-based OCR for UMKLA student card data extraction
Abstract
Manual data entry from student ID cards (KTM) is often inefficient and prone to errors. Therefore, automating this process is a crucial solution for educational institutions to improve accuracy and the speed of administrative services. This research aims to design and implement an Optical Character Recognition (OCR) system to automatically extract information from student ID card images of Universitas Muhammadiyah Klaten (UMKLA). The methodology involves image pre-processing using the OpenCV library to enhance image quality through grayscale conversion and Otsu's binarization. Subsequently, the Tesseract OCR Engine is used to convert the image into raw text, which is then parsed using Regular Expressions (Regex) to separate data fields such as Name, Student ID Number (NIM), and Program of Study. Test results indicate that the system can extract information with a good success rate, although accuracy is heavily influenced by image quality factors like lighting and text clarity. Fields with standard printed formats were found to have higher accuracy. In conclusion, this Tesseract-based system successfully demonstrates its feasibility for local automation of student ID card data. However, further development in the post-processing stage is required to handle more complex OCR output variations.