電腦科學與資訊工程科 Computer Science & Information Engineering
190011 Taiwan
基於模擬資料生成與即時效能最佳化之失語症多模態人工智慧輔助表達系統 A Multimodal AI System for Aphasia Patients Enabling Fluent Expression Featuring Simulated Data Generation and Optimized Real-Time Performance
This research develops a multimodal communication system for individuals with aphasia, leveraging multiple artificial intelligence technologies to enhance their ability to express themselves and engage in social interaction. The core of the system is an iOS application that processes multimodal inputs, including environmental images, speech, lip movements, gestures, and emotions. These inputs are integrated and interpreted by a large language model, which generates a complete narrative that is then vocalized using speech synthesis.
A key innovation of this study is the development of AphasiaSim-LLM, a novel method for generating highly realistic, simulated aphasic speech corpora. Furthermore, this research employs quantitative evaluation metrics, replacing traditional subjective scoring, to demonstrate that the Gemini 2.5 Flash model achieves superior performance in sentence reconstruction. Additionally, a lightweight gesture recognition model was constructed and optimized using the ORB algorithm for efficient keyframe extraction.
System performance and responsiveness are enhanced through several optimization strategies, including asynchronous processing, FFmpeg for video frame extraction, and the lightweight Flux text-to-image model. The result is a system that effectively assists patients with aphasia, enabling more fluid and natural communication.