Analysis of audio and video synchronization in ffplay

Posted by dcjones on Fri, 28 Jan 2022 05:10:11 +0100

ffplay also adopts this synchronization strategy by default.

Mainstream process

The main scheme of synchronizing video to audio in ffplay is to repeat the previous frame to wait for audio if the video is played too fast; If the video playback is too slow, the frame will be lost to catch up with the audio.

The logic of this part is implemented in the video output function video_ In refresh, before analyzing the code, let's review the flow chart of this function:

In this process, the step of "calculating the display duration of the previous frame" is very important. Let's first look at the code:

static void video_refresh(void *opaque, double *remaining_time)
{
    //......
    //lastvp previous frame, vp current frame, nextvp next frame

    last_duration = vp_duration(is, lastvp, vp);//Calculate the duration of the previous frame
    delay = compute_target_delay(last_duration, is);//Refer to audio clock to calculate the real duration of the previous frame

    time= av_gettime_relative()/1000000.0;//Take the system time
    if (time < is->frame_timer + delay) {//If the display time of the previous frame is not full, the previous frame is displayed repeatedly
        *remaining_time = FFMIN(is->frame_timer + delay - time, *remaining_time);
        goto display;
    }

    is->frame_timer += delay;//frame_ The timer is updated to the end time of the previous frame and also the start time of the current frame
    if (delay > 0 && time - is->frame_timer > AV_SYNC_THRESHOLD_MAX)
        is->frame_timer = time;//If the deviation from the system time is too large, it shall be corrected to the system time

    //Update video clock
    //Video sync audio doesn't work
    SDL_LockMutex(is->pictq.mutex);
    if (!isnan(vp->pts))
        update_video_pts(is, vp->pts, vp->pos, vp->serial);
    SDL_UnlockMutex(is->pictq.mutex);

    //......

    //Frame dropping logic
    if (frame_queue_nb_remaining(&is->pictq) > 1) {
        Frame *nextvp = frame_queue_peek_next(&is->pictq);
        duration = vp_duration(is, vp, nextvp);//Display duration of current frame
        if(time > is->frame_timer + duration){//If the system time is already greater than the current frame, the current frame is discarded
            is->frame_drops_late++;
            frame_queue_next(&is->pictq);
            goto retry;//Return to the starting position of the function and continue to retry (you can't directly lose the frame while here, because it's likely that the audio clock is re timed, so the delay value needs to be recalculated)
        }
    }
}

The logic of this code is included in the above flowchart. The main idea mentioned at the beginning is that if the video is played too fast, repeat the previous frame to wait for the audio; If the video playback is too slow, the frame will be lost to catch up with the audio. The implementation method is to refer to the audio clock, calculate how long the previous frame (the picture on the screen) should be displayed (including the time length of the frame itself), and then compare with the system time to see whether the next frame should be displayed.

In comparison with the system time, another concept frame is introduced_ timer. It can be understood that the frame display time, such as before updating, is the display time of the previous frame; After the update (is - > frame_timer + = delay), the time is displayed for the current frame.

The display time of the previous frame plus delay (how long it should be displayed (including the time length of the frame itself)) is the time when the display of the previous frame should end. See the following schematic diagram for the specific principle:


Here are three schematic diagrams:

  • time1: the system time is less than the time when lastvp ends display (frame_timer+dealy), that is, the position of the dotted circle. Lastvp should continue to be displayed
  • time2: the system time is greater than the end display time of lastvp, but less than the end display time of vp (the display time of vp starts from the dotted circle and ends in the black circle). At this time, neither lastvp is displayed repeatedly nor vp is discarded, that is, vp should be displayed
  • time3: the system time is greater than the end display time of vp (the position of the black circle is also the expected start display time of nextvp). vp should be discarded at this time.

FFmpeg/WebRTC/RTMP audio and video streaming media advanced development learning materials, teaching videos and learning roadmap. If necessary, add learning communication groups 973961276
Video learning address: https://ke.qq.com/course/3202131?flowToken=1031864

Calculation of delay

The next step is to see how the display duration delay of the most critical lastvp is calculated.

This is in the function compute_ target_ Implemented in delay:

static double compute_target_delay(double delay, VideoState *is)
{
    double sync_threshold, diff = 0;

    /* update delay to follow master synchronisation source */
    if (get_master_sync_type(is) != AV_SYNC_VIDEO_MASTER) {
        /* if video is slave, we try to correct big delays by
           duplicating or deleting a frame */
        diff = get_clock(&is->vidclk) - get_master_clock(is);

        /* skip or repeat frame. We take into account the
           delay to compute the threshold. I still don't know
           if it is the best guess */
        sync_threshold = FFMAX(AV_SYNC_THRESHOLD_MIN, FFMIN(AV_SYNC_THRESHOLD_MAX, delay));
        if (!isnan(diff) && fabs(diff) < is->max_frame_duration) {
            if (diff <= -sync_threshold)
                delay = FFMAX(0, delay + diff);
            else if (diff >= sync_threshold && delay > AV_SYNC_FRAMEDUP_THRESHOLD)
                delay = delay + diff;
            else if (diff >= sync_threshold)
                delay = 2 * delay;
        }
    }

    av_log(NULL, AV_LOG_TRACE, "video: delay=%0.3f A-V=%f\n",
            delay, -diff);

    return delay;
}

The comments in the above code are all the comments of the source code. The code is not long, and the comments account for half of it. It can be seen that this code is very important.

The hardest thing to understand in this code is sync_threshold, draw a diagram to help understand:
The coordinate axis in the figure is the value of diff, and diff is 0, which means that video clock and audio clock are exactly the same and perfectly synchronized. The color block at the bottom of the drawing indicates the value to be returned. The delay of the color block value refers to the incoming parameter, combined with the code in the previous section, that is, the display duration of lastvp.

As can be seen from the figure, sync_threshold is to create an area in which you can directly return to delay without adjusting the display duration of lastvp. That is, it is considered to be quasi synchronous in this area.

If less than - sync_threshold, that is, the video playback is slow and frames need to be lost appropriately. Specifically, a maximum value of 0 is returned. According to the previous frame_ The graph of timer should at least be updated to vp.

If greater than sync_threshold, then the video playback is too fast and lastvp needs to be repeated appropriately. Specifically, return twice the delay, that is, twice the display time of lastvp, that is, let lastvp display another frame.

If not only greater than sync_threshold, and more than AV_SYNC_FRAMEDUP_THRESHOLD, then return delay+diff, and the specific diff determines how long it will be displayed (the code intention is not very clear here. According to my understanding, it can be treated as returning 2*delay, or delay+diff, which is not necessary to distinguish)

Please note that "if it is not only greater than sync_threshold, but also greater than AV_SYNC_FRAMEDUP_THRESHOLD, then return delay+diff..." the analysis here is problematic, because I regard delay as diff
In fact, it should be that not only diff is greater than sync_threshold, and the delay is greater than AV_SYNC_FRAMEDUP_THRESHOLD, then delay+diff is returned. The reason why we should treat them differently is because AV_ SYNC_ FRAMEDUP_ The threshold is 0.1. If 2*delay takes a little longer
Let me say more about this function. Here is nothing more than to control the duration of video frames to achieve real-time or senseless synchronization with audio, and the specific implementation method is different. For example, the author of this code also said: I still don't know if it is the best guess

So far, we have basically analyzed the process of video synchronization and audio, and briefly summarized as follows:

  • The basic strategy is: if the video plays too fast, repeat the previous frame to wait for the audio; If the video playback is too slow, the frame will be lost to catch up with the audio.
  • This strategy is implemented by introducing frame_ The timer concept marks the display time of the frame and the time when the display should end, and then compares it with the system time to decide whether to repeat or lose the frame.
  • When the display of lastvp should end, in addition to the display duration of the frame itself, the difference between video clock and audio clock should also be considered.
  • It is not synchronized all the time, but there is a "quasi synchronous" difference area.

Topics: C++