Text-to-speech device for visually impaired patients

Topic > Text-to-speech device for visually impaired patients

IndexAbstractIntroductionMethodsSystem specificationsSoftware design of image processing moduleProject implementationCompletion of word correction moduleTelephone to voice converterSetupResultsConclusionReferencesAbstractWith a maximum visibility of six meters and a maximum wide view of 20 degrees, people those who suffer from low vision are unable to see words and letters in ordinary newsprint. This fact makes the reading process difficult and can disturb the learning process and slow down the development of the patient's intelligence. Therefore, you need a device that helps them read more easily. One of the devices being developed today is a device that uses another sense, namely the auditory sense. Text-to-speech is a device that scans and reads Indonesian textbooks into voices. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an original essay The purpose of the device is to process image as input into speech as output. This article describes the design, implementation, and experimental results of the device. This device consists of three modules: image processing module, word correction module and speech processing module. The device was developed based on the Raspberry Pi v2 with a processor speed of 900 MHz. The audio output is easily understandable, has a total error rate of less than 2% and a processing time of almost two minutes for input of text with A4 paper size. This device provides convenience to visually impaired people by guiding them through voice, it also has the ability to play and stop output while reading. Introduction According to Thylefors in Gianini (2004), impaired vision can have negative effects on learning and social interaction. It can influence the natural development of intelligence and academic, social and professional skills [1]. Based on Riskesdas data in 2013, the total number of visually impaired people in Indonesia amounted to 2,133,017 [2]. Low vision impairment cannot be fixed with the help of glasses. The maximum visibility of these patients is 6 meters with a maximum width of 20 degrees. This causes visually impaired people to be unable to see normal printed paper. They can only see if the size of the characters or letters is large enough. This condition affected the duration of the reading process and made the eyes tired. To help improve the quality of life of people with vision problems, you need a tool to read the article. The rate of visual impairment may vary in each visually impaired individual. Therefore a device developed in this work used other sensory functions when receiving information from a text. The device has converted speech synthesis specially designed for Indonesians with low vision so that they can easily use this device without having to ask for help from others and can use these devices to understand Indonesian language literature. The speech device consisted of three main modules: the image processing module, the word correction module, and the speech processing modules. The image processing module sets the object position, focus and lighting of the camera, takes photos and converts the image to text. The word correction module makes corrections to the output image processing module to improve accuracy by matching it with the Indonesian dictionary. The speech processing module transforms writing into sound and processes it with featuresspecific physics so that the sound can be understood. One element in this image processing module is OCR. In using the OCR engine initial and initial steps are required to get the best input from the OCR to reduce the disability of this OCR engine. The installation state is well adapted to the specifications of the desired initial device. In order for the desired result of this processing to have a minimum error rate is also a short processing time. This module does not change the OCR algorithm, but provides additional state to get the best input of OCR.IEEE OCR or optical characterRecognition is a technology that automatically recognizes the character through the optical mechanism, this technology imitates the ability of human senses sight , where the camera replaces the eye and image processing is performed in the computer engine as a substitute for the human brain [3]. Tesseract OCR is a type of matrix matching OCR engine [4]. The choice of Tesseract engine is due to the fact that this machine has been widely accepted in the world, as well as the flexibility and extensibility of these machines and the fact that many communities are active researchers to develop this OCR engine. Machines still have defects such as edge distortion and dim light effects, so it is still difficult for most OCR engines to achieve high-precision text [5]. It needs support and conditions to get the slightest flaw. System Specifications The device is designed with the following restrictions: a. The reading distance is 38-42cm. B. The maximum thickness of the material to be read is 3 cm. C. The minimum illumination is 250 lumen/m2 (environmental classes, office with easy work) d. The maximum inclination of the text line is 5 degrees from the vertical. And. The maximum size of reading material is A4 or 210x297mm f. The font size is a minimum of 10 pt. G. Type characters including Roman, Egyptian, or Sans Serif fonts. Hardware system design The holder in Fig. 2 is designed so that a maximum of A4 size sheets can be captured entirely by the camera. The distance from the camera to the object is 40 cm, and a 15 cm long pole is added to position the camera above the center of the object. The Raspberry Pi camera module uses manual focus adjustment, so you need to adjust the initial lens setting. Good lighting conditions are required to make the input image sharper. A series of LEDs is then added to provide additional light if the environment has low light intensity. Tesseract OCR Implementation The input image captured by the camera has a size of 5 MPI (2592 x 1944 pixels) or 215 ppi (pixels per inch). According to the specifications of the Tesseract OCR engine, the minimum size of characters that can be read is 20 pixels (uppercase letters). Tesseract OCR accuracy will decrease with font size 10 pt. Software design The software processes the input image and converts it to text format. The implementation of the software. Software design of the image processing module. The image is captured by the user via GPIO pins connected to the touch key using the interrupt function. Additionally, the photo is taken using the raspistill program with sharpness mode to sharpen the image. The resulting image has a .jpg format with a resolution of 2592 x 1944 pixels. B. The Word Correction module Spell Check Spell check is a task to predict misspelled words in the document. This forecast can be displayed to the user in various ways. Corrective work is work aimed at replacingthe word spelled incorrectly under the assumption of the correct spelling. The most appropriate approach is to model something that directly caused the error and encode it into a modeling algorithm or error. The Damerau-Levenshtein edit distance was introduced as a way to detect spelling errors (Damerau, 1964). Phonetic indexing algorithms such as metaphone, used by GNU Aspel (Atkinson, 2009), with backward words with similar pronunciations ("soundslike" pronunciation) and allow correction, the word looks different from the orthographic word. Metaphone is based on a data file that contains phonetic information. Linguistic intuition about different causes of a spelling error can also be explicitly represented in the spelling system (Deorowicz and Ciura, 2005). Almost all spelling systems currently use lexicon (dictionary). Dictionary-based ones have difficulty dealing with things that do not appear in the dictionary, such as nouns, foreign terms or interpretations and neologisms, which can increase the proportion of terms that do not exist in the dictionary (Ahmad and Kondrak, 2005) [6].Correction of wordsThe module receives input from the image processing module in the form of text from the image processing module. The image processing module cannot define the truth or falsity of the output word, so the correction module of this word, the correction for whole words output by image processing requires a module. In order to improve the accuracy of the output image processing module, design the word correction module. The word correction module consists of several functions. In word correction software there is one main function, which is the correct function. Other functions such as helper function to adapt input to Indonesian grammar. The correct function matches the input and correct it. The correct function uses a dictionary (list of words) in Indonesian as a reference to correct it. There are helper functions to overcome constraints on using numbers and the dictionary name as described in the literature, such as: 1. The function to split the text into words. 2. The function to check the number in the text. 3. The function to check the capital letter at the beginning of the sentence. 4. The function to check the punctuation mark at the end of the sentence. 5. The function to check the name (use capital letter) in the sentence. 6. The function to combine all words resulting from the previous execution. Project Implementation The implementation of the word correction module consists of: The first step is to organize the Indonesian language words to be used in the dictionary. The dictionary is used to compare each input with the Indonesian language. The words in this dictionary are from the words that exist in KBBI (Kamus Besar Bahasa Indonesia). The number of words present in this dictionary is the result of the reduction of 50,850 words. The number is a combination of base words, conjunctions, repeated words, absorption words, numbers, question words, pronouns, affixes, prefixes and suffixes. Compilation of the word correction form Word correction form completed by adapting the corrector created by Peter Norvig . In this word correction module, since the common error resulting from output image processing usually occurs in the letter, not in the length of the word, the correction function simply replaces the error word. This function will replace the word only if the length of the input is equal to the length of the word in the dictionary. Using this type of substitution also takes into account the computational load.If you only use a substitution function, since the word length is n and the edit distance is one, you will only experience n-1 transposition distances. From the spelling correction literature, it is stated that 80% to 95% of spelling errors have an edit distance of one from the target. Research conducted by Peter Novrig on 270 spelling errors revealed that only 76% of them have an edit distance. Further searches achieved good coverage, in test case 270, only three had a distance greater than two. This means that the input correction will include two letters in 98.9% of cases. Since the correction does not exceed two distance corrections, the optimization that can be performed is simply to keep the replacement word that will be used for totally familiar words [7]. There is no general provision limiting the character differences that are corrected. However, based on the above research results and considering the computational load, a two-character limit is used for this correction function. This correction function uses probability-based methods that train for the input word so that the possible words that will be output as a replacement for a correct word depend on the frequency of occurrence of the word. The Text-to-Speech TTS (Text-to-Speech) speech processing module is a system that converts input from text into speech. Speech synthesis consists in principle of two subsystems which are: Text to phoneme converter is used to convert the inputted sentence in a particular language in the form of text into a series of codes which are usually represented by the sound of phoneme codes, its duration and height. This section is language dependent. Phone to Speech Converter The Phone to Speech Converter will accept input in the form of codes, as well as the pitch and duration of the phonemes produced from the previous section. System Design Fig. 5 shows the diagram of the speech processing module.Fig. 5. Design level 0 of speech processing module Considering the use of Linux platform, the availability of Indonesian dialect and the simulation results in TTS, eSpeak and Google TTS were selected for the TTS software. The specifications of the general system function to be obtained are as follows: The output voice is in Indonesian dialect with a reading intelligibility tolerance percentage of 0.02%. There are additional features such as playing, stopping and pausing sound. Project Implementation The implementation diagram of the speech processing module. Python's standard library covers a wide range of modules. The speech processing module uses the operating system package which provides file and process operations, the pygame package which provides functions for playing sounds, the RPi.GPIO package which provides a class to control the GPIO on a Raspberry Pi, and the subprocess package that allows you to spawn new processes, connect to their input/output/error pipes, and get their return codes. isPause and isStop are variables that will be used for audio player functionality. These variables are initialized with a value of False, which means they have not been active.SettingSetting the GPIO pin numbering in accordance with the breakout board.The main program provides functions to retrieve and process the input image, convert it to a signal audio and play, stop, pause, or exit the speech output. Start import pygame modules and subprocesses, convert text file (.txt) to audio file (.wav) set GPIO 2-3, 2015