Analysis of tts TextToSpeech source code principle of android system and customization of tts engine

Posted by Renlok on Sat, 11 Jan 2020 06:18:46 +0100

TextToSpeech is the text to speech service, which is a native interface service provided by Android system. The native tts engine application can download the resource file of the corresponding language through detecting the system language, so as to achieve the ability of broadcasting the text of the specified voice. But everything is in the environment of google service. google service is disabled in Android devices used in China, and the most important thing in China is the ability of Chinese text broadcasting. How to achieve that.

TextToSpeech source code analysis

First of all, I'm used to reading class comments. The main thing I'm talking about here is that TextToSpeech can convert text to speech to play or generate audio files, and the function must be completed after initialization. This initialization interface is TextToSpeech.OnInitListener. When you use the completed TextToSpeech instance, remember to shutdown to release the native resources used by the engine

 * Synthesizes speech from text for immediate playback or to create a sound file.
 * <p>A TextToSpeech instance can only be used to synthesize text once it has completed its
 * initialization. Implement the {@link TextToSpeech.OnInitListener} to be
 * notified of the completion of the initialization.<br>
 * When you are done using the TextToSpeech instance, call the {@link #shutdown()} method
 * to release the native resources used by the TextToSpeech engine.
public class TextToSpeech {

Then we take a look at the initialization callback interface. We can see that when the status parameter of onInit returns Success, it means that the initialization is successful. Anything needs to be called after this, such as setting parameters or calling the playback interface. Otherwise, it doesn't work.
Here we need to learn Google's annotation method, and list all the states of the parameters, which is very clear.

     * Interface definition of a callback to be invoked indicating the completion of the
     * TextToSpeech engine initialization.
    public interface OnInitListener {
         * Called to signal the completion of the TextToSpeech engine initialization.
         * @param status {@link TextToSpeech#SUCCESS} or {@link TextToSpeech#ERROR}.
        void onInit(int status);

For further analysis, first we attach a demo program fragment of TextToSpeech.

TextToSpeech usage example

 ........On behalf of omission..........
 textToSpeech = new TextToSpeech(this, this); // Parameter Context,TextToSpeech.OnInitListener
     * Initialize TextToSpeech engine
     * status:SUCCESS Or ERROR
     * setLanguage Setup language
    public void onInit(int status) {
        if (status == TextToSpeech.SUCCESS) {
            int result = textToSpeech.setLanguage(Locale.CHINA);
            if (result == TextToSpeech.LANG_MISSING_DATA
                    || result == TextToSpeech.LANG_NOT_SUPPORTED) {
                Toast.makeText(this, "Data missing or not supported", Toast.LENGTH_SHORT).show();
    public void onClick(View v) {
        if (textToSpeech != null && !textToSpeech.isSpeaking()) {
            textToSpeech.setPitch(0.0f);// Set tone
            textToSpeech.speak("I'm the text to play ",
                    TextToSpeech.QUEUE_FLUSH, null);
    protected void onStop() {
        textToSpeech.stop(); // Stop tts
        textToSpeech.shutdown(); // Close, release resources

With the demo example here, we have a basic understanding of the use of TextToSpeech. Then, we analyze the source code by using the calling procedure of the demo.
First, of course, create a new TextToSpeech object. Let's see its structure. Then we found three, but only the first two are visible to our users, and the last one is the construction method used inside the system. The difference between the first two construction methods is that the former uses the default TTS engine of the system, and the latter can specify the TTS engine with the package name String engine.

public TextToSpeech(Context context, OnInitListener listener) {
        this(context, listener, null);
    public TextToSpeech(Context context, OnInitListener listener, String engine) {
        this(context, listener, engine, null, true);
        public TextToSpeech(Context context, OnInitListener listener, String engine,
            String packageName, boolean useFallback) {
        mContext = context;
        mInitListener = listener;
        mRequestedEngine = engine;
        mUseFallback = useFallback;

        mEarcons = new HashMap<String, Uri>();
        mUtterances = new HashMap<CharSequence, Uri>();
        mUtteranceProgressListener = null;

        mEnginesHelper = new TtsEngines(mContext);

Of course, we use empty implementations, but the actual work is still internal constructors. Then the important function is the initts method.
initTts is an important function in TextToSpeech, which reveals how the system selects and connects the Tts engine. The code is a bit longer, but it has to be listed here.

private int initTts() {
        // Step 1: Try connecting to the engine that was requested.
        if (mRequestedEngine != null) {
            if (mEnginesHelper.isEngineInstalled(mRequestedEngine)) {
                if (connectToEngine(mRequestedEngine)) {
                    mCurrentEngine = mRequestedEngine;
                    return SUCCESS;
                } else if (!mUseFallback) {
                    mCurrentEngine = null;
                    return ERROR;
            } else if (!mUseFallback) {
                Log.i(TAG, "Requested engine not installed: " + mRequestedEngine);
                mCurrentEngine = null;
                return ERROR;

        // Step 2: Try connecting to the user's default engine.
        final String defaultEngine = getDefaultEngine();
        if (defaultEngine != null && !defaultEngine.equals(mRequestedEngine)) {
            if (connectToEngine(defaultEngine)) {
                mCurrentEngine = defaultEngine;
                return SUCCESS;

        // Step 3: Try connecting to the highest ranked engine in the
        // system.
        final String highestRanked = mEnginesHelper.getHighestRankedEngineName();
        if (highestRanked != null && !highestRanked.equals(mRequestedEngine) &&
                !highestRanked.equals(defaultEngine)) {
            if (connectToEngine(highestRanked)) {
                mCurrentEngine = highestRanked;
                return SUCCESS;

        // NOTE: The API currently does not allow the caller to query whether
        // they are actually connected to any engine. This might fail for various
        // reasons like if the user disables all her TTS engines.

        mCurrentEngine = null;
        return ERROR;

When we analyze this code, we can see that there are three steps in annotation writing:
Step 1: Try connecting to the engine that was requested.
Step 2: Try connecting to the user's default engine.
Step 3: Try connecting to the highest ranked engine in the system.
1: attempt to connect to the requested engine. 2: An attempt was made to connect to the user default engine. 3: Trying to connect to the highest ranked engine.
So, who is the engine required. We can go back to TextToSpeech's second and third constructors and see that the engine of String type can be set in the parameter. If this parameter is not empty, the system will look for a connection to the TS engine.
Then the default engine is obtained through getDefaultEngine.
It can be understood from the comments that the default here is similar to the system settings. If there are multiple engines to choose from, the user chooses the default engine. For example, the current home-made mobile phones, the system's own, the mobile phone manufacturer's own playback engines, such as Xiaomi and Huawei's, and the user's manually installed ones, such as iFLYTEK voice input method, have broadcast functions. The user can set the default engine. If the native system is the only engine by default, it is called "com.svox.pico".

     * @return the default TTS engine. If the user has set a default, and the engine
     *         is available on the device, the default is returned. Otherwise,
     *         the highest ranked engine is returned as per {@link EngineInfoComparator}.
    public String getDefaultEngine() {
        String engine = getString(mContext.getContentResolver(),
        return isEngineInstalled(engine) ? engine : getHighestRankedEngineName();

Finally, step three, connect to the highest ranked engine.
Gethighestrankedaenginame calls getEngines again

     * Gets a list of all installed TTS engines.
     * @return A list of engine info objects. The list can be empty, but never {@code null}.
    public List<EngineInfo> getEngines() {
        PackageManager pm = mContext.getPackageManager();
        Intent intent = new Intent(Engine.INTENT_ACTION_TTS_SERVICE);
        List<ResolveInfo> resolveInfos =
                pm.queryIntentServices(intent, PackageManager.MATCH_DEFAULT_ONLY);
        if (resolveInfos == null) return Collections.emptyList();

        List<EngineInfo> engines = new ArrayList<EngineInfo>(resolveInfos.size());

        for (ResolveInfo resolveInfo : resolveInfos) {
            EngineInfo engine = getEngineInfo(resolveInfo, pm);
            if (engine != null) {
        Collections.sort(engines, EngineInfoComparator.INSTANCE);

        return engines;

It is obvious that the system uses PackageManager to obtain all the application's intent filter from the system as an intent (engine. Intent? Action? tts? Service) application, which is set as the tts engine.
Then I found three steps. If there is a tts engine in the system, I spent so much time searching. I'm sure I found an engine that can be connected and got the name of the engine. Then TextToSpeech goes to bind the service. It also uses the intent (engine. Intent? Action? tts? Service). Here is the code for connecting the ordinary service.

private boolean connectToEngine(String engine) {
        Connection connection = new Connection();
        Intent intent = new Intent(Engine.INTENT_ACTION_TTS_SERVICE);
        boolean bound = mContext.bindService(intent, connection, Context.BIND_AUTO_CREATE);
        if (!bound) {
            Log.e(TAG, "Failed to bind to " + engine);
            return false;
        } else {
            Log.i(TAG, "Sucessfully bound to " + engine);
            mConnectingServiceConnection = connection;
            return true;

The private class Connection implements ServiceConnection. The Connection class inherits the native ServiceConnection class and implements some aidl callback methods. There is also an internal class, SetupConnectionAsyncTask, which contains a lot of content, and this asynchronous task calls back the onInit method in our demo example of TextToSpeech. Through dispatchOnInit(result); if the Connection is disconnected, the user dispatchOnInit(ERROR) will be called back; if the Connection is successfully called back by bindservice, the mService = ITextToSpeechService.Stub.asInterface(service) in the onServiceConnected method; this mService is the Binder interface of the TS engine we get, and the actual engine method will be called through this.
At this point, if the connection is successful, we can normally use the methods provided by TextToSpeech, such as Speak, stop and so on.
After talking for a long time, in fact, this Service is connected to TextToSpeechService. It is also a Service provided by Android system source code and inherited by the native TS engine of the system.

Native TS engine analysis

We know that TextToSpeech is the ability of tts obtained by bind ing TextToSpeechService, so how does TtsEngine relate to it.
As you can see in / external / svox / Pico / compat / SRC / COM / Android / TTS / compat / of the system source code, this class inherits the Service of the system. That is, public abstract class compacttsservice extends texttospeechservice. At the same time, some interface methods are implemented in it. This abstract class is inherited by the real engine Service.
public class PicoService extends CompatTtsService
Then the actual work is done in the CompatTtsService.
Private SynthProxy mnivesynth = null; this SynthProxy class implements getLanguage, isLanguageAvailable, setLanguage, speak, stop, shutdown and other methods, so this SynthProxy is a further implementation class.

 * The SpeechSynthesis class provides a high-level api to create and play
 * synthesized speech. This class is used internally to talk to a native
 * TTS library that implements the interface defined in
 * frameworks/base/include/tts/TtsEngine.h
public class SynthProxy {

    static {

It can be seen from the comments that the final implementation is JNI, which is implemented by ttscompat's so.

 public int speak(SynthesisRequest request, SynthesisCallback callback) {
        return native_speak(mJniData, request.getText(), callback);
    public void shutdown() {
        mJniData = 0;

It should be said that the implementation principle and modules of TextToSpeech have been finished. What about customizing the tts engine.

Custom TS engine

Because the native TextToSpeech does not provide the Chinese broadcast ability, even if it does, it is difficult to use the network in the domestic environment, so many manufacturers will integrate their own voice broadcast engine into the system. So how can we also make a customized tts engine.

First, inherit the system TextToSpeechService class and implement the methods in it.

Of course, the system also provides us with an example
public class RobotSpeakTtsService extends TextToSpeechService
Of course, you need to implement the abstract methods in TextToSpeechService

protected abstract int onIsLanguageAvailable(String lang, String country, String variant);
protected abstract String[] onGetLanguage();
 protected abstract int onLoadLanguage(String lang, String country, String variant);
 protected abstract void onStop();
     * Tells the service to synthesize speech from the given text. This method should block until
     * the synthesis is finished. Called on the synthesis thread.
     * @param request The synthesis request.
     * @param callback The callback that the engine must use to make data available for playback or
     *     for writing to a file.
    protected abstract void onSynthesizeText(SynthesisRequest request, SynthesisCallback callback);

The most important method of generation, with comments, is to generate audio based on the provided text, and it will block until the end of generation. According to the parameters of SynthesisRequest type, get the broadcast parameters and call back the status, and call back to the system through the callback of synthescallback type.
Here is the implementation code of the tts engine example provided by the system just now. Since there is no such local source code, there will be a line number from the online source code, which will not hinder reading.

156    protected synchronized void onSynthesizeText(SynthesisRequest request,
157            SynthesisCallback callback) {
158        // Note that we call onLoadLanguage here since there is no guarantee
159        // that there would have been a prior call to this function.
160        int load = onLoadLanguage(request.getLanguage(), request.getCountry(),
161                request.getVariant());
163        // We might get requests for a language we don't support - in which case
164        // we error out early before wasting too much time.
165        if (load == TextToSpeech.LANG_NOT_SUPPORTED) {
166            callback.error();
167            return;
168        }
170        // At this point, we have loaded the language we need for synthesis and
171        // it is guaranteed that we support it so we proceed with synthesis.
173        // We denote that we are ready to start sending audio across to the
174        // framework. We use a fixed sampling rate (16khz), and send data across
175        // in 16bit PCM mono.
176        callback.start(SAMPLING_RATE_HZ,
177                AudioFormat.ENCODING_PCM_16BIT, 1 /* Number of channels. */);
179        // We then scan through each character of the request string and
180        // generate audio for it.
181        final String text = request.getText().toLowerCase();
182        for (int i = 0; i < text.length(); ++i) {
183            char value = normalize(text.charAt(i));
184            // It is crucial to call either of callback.error() or callback.done() to ensure
185            // that audio / other resources are released as soon as possible.
186            if (!generateOneSecondOfAudio(value, callback)) {
187                callback.error();
188                return;
189            }
190        }
192        // Alright, we're done with our synthesis - yay!
193        callback.done();
194    }

It can be seen that before the engine starts to work, it needs to call back. Start (sampling? Rate? Hz, audioformat. Encoding? pcm? 16bit, 1 / * number of channels. * /); tell the system the sampling frequency of generated audio, 16 bit pcm format audio, single channel. After receiving this callback, the system begins to wait for receiving audio data. And start playing tts.
generateOneSecondOfAudio is to pretend to generate a demo audio, simulate the real engine generation, and call callback.done if the generation is completed.
The advantage of this method is that it has few functions and does not need to deal with different Android platforms. Other interfaces are implemented as per the system native.
The disadvantage is that the engine is highly required and debugging is troublesome. If there is no android source code of the corresponding system, it is difficult to debug if there is a problem, because the log of the system is not printed, and it is difficult to locate the internal problems.

Second, the AIDL interface of TextToSpeech corresponding to the system is directly taken for implementation.

After the previous analysis, we know that TextToSpeech is connected to the engine in the form of bindservice, and the Service uses AIDL as the interface. We can directly take out the corresponding AIDL, customize the engine to implement the server, and keep the client unchanged. Of course, the AIDL interface of the server should remain unchanged with the system.
Here we want to achieve:
How to realize AIDL is not explained in detail here. Students with certain foundation must have known their ideas here.
The advantage of this method is that it is highly customizable, and the exposed interfaces can be implemented according to the actual situation.
The disadvantage is that there are many methods to be implemented, and because of the different versions of Android system, the aidl interface is upgraded, and the engine will not be very general.

Of course, all implementations need to delete the system's native Tts engine. If you can't get the system source code, you can only specify the engine name as mentioned above.
In addition, the most important thing is to enable TextToSpeech of the system to search the customized engine. The intent filter of the service in Android manifest.xml mentioned above is essential, otherwise it does not mean that the application is a tts engine.
Attach the configuration of picoservice in the system.

22        <service android:name=".PicoService"
23                  android:label="@string/app_name">
24            <intent-filter>
25                <action android:name="android.intent.action.TTS_SERVICE" />
26                <category android:name="android.intent.category.DEFAULT" />
27            </intent-filter>
28            <meta-data android:name="android.speech.tts" android:resource="@xml/tts_engine" />
29        </service>

Well, at the end of this article, I'd like to give you some compliments. If you have any questions, you can reply for discussion.

Published 10 original articles, won praise and 10000 visitors+
Private letter follow

Topics: Android Java Google Mobile