FFmpeg data structure AVFrame

Posted by monke on Mon, 03 Jan 2022 17:15:13 +0100

This paper is based on ffmpeg version 4.1.

1. Data structure definition

1.1 related basic concepts

Before reading AVFrame data structure, you need to understand several basic concepts related to it (only video related concepts are considered here):

pixel_format: pixel format, the arrangement format of image pixels in memory. A pixel format contains information such as color space, sampling mode, storage mode, bit depth, etc.

bit_depth: bit depth refers to the bit width occupied by a single sampling point of each component (Y, U, V, R, G, B, etc.).

Plane: an area of memory that stores one or more components of an image. In the planar storage mode, at least one component occupies A single plane, and the planar storage mode has at least two planes. Specifically, yuv420p format has three planes Y, U and V, nv12 format has two planes Y and UV, and gbrap format has four planes G, B, R and A. In the packed storage mode, because the pixels of all components are interleaved, there is only one plane in the packed storage mode.

slice: a continuous row in the image must be continuous, from top to bottom or from bottom to top

Stripe / pitch: the number of bytes occupied by a certain component (such as brightness component or chroma component) in a line of image, that is, the width of a line of data in a plane. With alignment requirements, the calculation formula is as follows:

stride value = Image width * Component number * Single sample width / Horizontal subsampling factor / 8

Among them, the image width indicates how many pixels the image width is, and the number of components refers to how many components the current plane contains (for example, in rgb24 format, a plane has three components: R, G and B), Unit standard width refers to the actual number of bits occupied by a sample of a component in memory after considering alignment (for example, bit depth 8 accounts for 8 bit width, bit depth 10 actually accounts for 16 bit width, and the alignment value is related to the platform). Horizontal sub sampling factor refers to how many pixels are sampled in the horizontal direction (luminance samples are not down sampled, so the sampling factor is always 1).

For a detailed description of the above concepts, please refer to“ Color space and pixel format ”Section 4.

1.2 AVFrame data structure

struct AVFrame is defined in < libavutil / frame h>

struct AVFrame frame;

AVFrame stores the decoded original data. In decoding, AVFrame is the output of the decoder; In encoding, AVFrame is the input of encoder. In the following figure, the data type of "decoded frames" is AVFrame:

 _______              ______________
|       |            |              |
| input |  demuxer   | encoded data |   decoder
| file  | ---------> | packets      | -----+
|_______|            |______________|      |
                                      |         |
                                      | decoded |
                                      | frames  |
 ________             ______________       |
|        |           |              |      |
| output | <-------- | encoded data | <----+
| file   |   muxer   | packets      |   encoder
|________|           |______________|

AVFrame data structure is very important. It has many members, resulting in a long definition of data structure. In the data structure definitions cited below, lengthy comments and most members are omitted. First, the usage of AVFrame is described in general, and then some important members are extracted for separate description:

 * This structure describes decoded (raw) audio or video data.
 * AVFrame must be allocated using av_frame_alloc(). Note that this only
 * allocates the AVFrame itself, the buffers for the data must be managed
 * through other means (see below).
 * AVFrame must be freed with av_frame_free().
 * AVFrame is typically allocated once and then reused multiple times to hold
 * different data (e.g. a single AVFrame to hold frames received from a
 * decoder). In such a case, av_frame_unref() will free any references held by
 * the frame and reset it to its original clean state before it
 * is reused again.
 * The data described by an AVFrame is usually reference counted through the
 * AVBuffer API. The underlying buffer references are stored in AVFrame.buf /
 * AVFrame.extended_buf. An AVFrame is considered to be reference counted if at
 * least one reference is set, i.e. if AVFrame.buf[0] != NULL. In such a case,
 * every single data plane must be contained in one of the buffers in
 * AVFrame.buf or AVFrame.extended_buf.
 * There may be a single buffer for all the data, or one separate buffer for
 * each plane, or anything in between.
 * sizeof(AVFrame) is not a part of the public ABI, so new fields may be added
 * to the end with a minor bump.
 * Fields can be accessed through AVOptions, the name string used, matches the
 * C structure field name for fields accessible through AVOptions. The AVClass
 * for AVFrame can be obtained from avcodec_get_frame_class()
typedef struct AVFrame {
    uint8_t *data[AV_NUM_DATA_POINTERS];
    int linesize[AV_NUM_DATA_POINTERS];
    uint8_t **extended_data;
    int width, height;
    int nb_samples;
    int format;
    int key_frame;
    enum AVPictureType pict_type;
    AVRational sample_aspect_ratio;
    int64_t pts;
} AVFrame;

AVFrame usage:

  1. AVFrame object must call av_frame_alloc() is allocated on the heap. Note that this refers to the AVFrame object itself. The AVFrame object must call av_frame_free() to destroy.
  2. The data buffer contained in AVFrame is
  3. AVFrame usually only needs to be allocated once, and then can be reused many times. AV should be called before each reuse_ frame_ Unref() resets the frame to its original clean, usable state.

Some important members are excerpted below for explanation:

     * pointer to the picture/channel planes.
     * This might be different from the first allocated byte
     * Some decoders access areas outside 0,0 - width,height, please
     * see avcodec_align_dimensions2(). Some filters and swscale can read
     * up to 16 bytes beyond the planes, if these filters are to be used,
     * then 16 extra bytes must be allocated.
     * NOTE: Except for hwaccel formats, pointers not needed by the format
     * MUST be set to NULL.
    uint8_t *data[AV_NUM_DATA_POINTERS];

Store original frame data (uncoded original image or audio format as output of decoder or input of encoder).

data is a pointer array. Each element of the array is a pointer to a plane of an image in video or a plane of a channel in audio.

For a detailed description of the image plane, refer to“ Color space and pixel format ”, detailed description parameters of audio plane“ ffplay source code analysis 6 - Audio resampling 6.1 Section 1 ”. The following is a brief description:
For packet format, the Y, U and V interleaving of a YUV image is stored in a plane, such as YUVYUV, data[0] points to this plane;
For a two channel audio frame, its left channel L and right channel R are interleaved and stored in a plane, such as LR, data[0] points to this plane.
For planar format, a YUV image has three planes: Y, U and V. data[0] points to Y plane, data[1] points to U plane and data[2] points to V plane;
A two channel audio frame has two planes: left channel L and right channel R. data[0] points to L plane and data[1] points to R plane.


     * For video, size in bytes of each picture line.
     * For audio, size in bytes of each plane.
     * For audio, only linesize[0] may be set. For planar audio, each channel
     * plane must be the same size.
     * For video the linesizes should be multiples of the CPUs alignment
     * preference, this is 16 or 32 for modern desktop CPUs.
     * Some code requires such alignment other code can be slower without
     * correct alignment, for yet other it makes no difference.
     * @note The linesize may be larger than the size of usable data -- there
     * may be extra padding present for performance reasons.
    int linesize[AV_NUM_DATA_POINTERS];

linesize is an array.

For video, linesize each element is the size (in bytes) of a row of images in an image plane. Pay attention to alignment requirements. The linesize value is the same as the stripe value in Section 1.1. For planar format video, there are multiple planes, and the linesize of each plane represents the storage space occupied by a line of images in the current plane. For packed video, there is only one plane, and line size represents the storage space occupied by one line of images.

For audio, linesize each element is the size (in bytes) of an audio plane. packed multichannel audio has only one plane, while planar multichannel audio has multiple planes. Only linesize[0] is used for audio, even if there are multiple planes. For planar audio, each plane must be the same size.

Linesize may fill in some additional data due to performance considerations, so linesize may be larger than the actual corresponding audio and video data size.


     * pointers to the data planes/channels.
     * For video, this should simply point to data[].
     * For planar audio, each channel has a separate data pointer, and
     * linesize[0] contains the size of each channel buffer.
     * For packed audio, there is just one data pointer, and linesize[0]
     * contains the total size of the buffer for all channels.
     * Note: Both data and extended_data should always be set in a valid frame,
     * but for planar audio with more channels that can fit in data,
     * extended_data must be used in order to access all channels.
    uint8_t **extended_data;

???? extended_ What does data do????
For video, point directly to the data [] member.
For audio, packet format audio has only one plane, and the sampling points of each channel in an audio frame are interleaved and stored in this plane; Planar format audio one plane per channel. In multichannel planar format audio, extended must be used_ Data to access all channels. What do you mean?
In a valid video / audio frame, data and extended_ Both data members must have valid values set.

width, height

     * @name Video dimensions
     * Video frames only. The coded dimensions (in pixels) of the video frame,
     * i.e. the size of the rectangle that contains some well-defined values.
     * @note The part of the frame intended for display/presentation is further
     * restricted by the @ref cropping "Cropping rectangle".
     * @{
    int width, height;

Video frame width and height (pixels).


     * number of audio samples (per channel) described by this frame
    int nb_samples;

The number of sampling points contained in a single channel in an audio frame.


     * format of the frame, -1 if unknown or unset
     * Values correspond to enum AVPixelFormat for video frames,
     * enum AVSampleFormat for audio)
    int format;

Frame format. If the format is unknown or not set, the value is - 1.
For video frames, this value corresponds to the "enum AVPixelFormat" structure:

enum AVPixelFormat {
    AV_PIX_FMT_NONE = -1,
    AV_PIX_FMT_YUV420P,   ///< planar YUV 4:2:0, 12bpp, (1 Cr & Cb sample per 2x2 Y samples)
    AV_PIX_FMT_YUYV422,   ///< packed YUV 4:2:2, 16bpp, Y0 Cb Y1 Cr
    AV_PIX_FMT_RGB24,     ///< packed RGB 8:8:8, 24bpp, RGBRGB...
    AV_PIX_FMT_BGR24,     ///< packed RGB 8:8:8, 24bpp, BGRBGR...

For audio frames, this value corresponds to the "enum AVSampleFormat" format:

enum AVSampleFormat {
    AV_SAMPLE_FMT_U8,          ///< unsigned 8 bits
    AV_SAMPLE_FMT_S16,         ///< signed 16 bits
    AV_SAMPLE_FMT_S32,         ///< signed 32 bits
    AV_SAMPLE_FMT_FLT,         ///< float
    AV_SAMPLE_FMT_DBL,         ///< double

    AV_SAMPLE_FMT_U8P,         ///< unsigned 8 bits, planar
    AV_SAMPLE_FMT_S16P,        ///< signed 16 bits, planar
    AV_SAMPLE_FMT_S32P,        ///< signed 32 bits, planar
    AV_SAMPLE_FMT_FLTP,        ///< float, planar
    AV_SAMPLE_FMT_DBLP,        ///< double, planar
    AV_SAMPLE_FMT_S64,         ///< signed 64 bits
    AV_SAMPLE_FMT_S64P,        ///< signed 64 bits, planar

    AV_SAMPLE_FMT_NB           ///< Number of sample formats. DO NOT USE if linking dynamically


     * 1 -> keyframe, 0-> not
    int key_frame;

Identification of whether the video frame is a key frame, 1 - > key frame, 0 - > non key frame.


     * Picture type of the frame.
    enum AVPictureType pict_type;

Video frame type (I, B, P, etc.). As follows:

 * @}
 * @}
 * @defgroup lavu_picture Image related
 * AVPicture types, pixel formats and basic image planes manipulation.
 * @{

enum AVPictureType {
    AV_PICTURE_TYPE_NONE = 0, ///< Undefined
    AV_PICTURE_TYPE_I,     ///< Intra
    AV_PICTURE_TYPE_P,     ///< Predicted
    AV_PICTURE_TYPE_B,     ///< Bi-dir predicted
    AV_PICTURE_TYPE_SI,    ///< Switching Intra
    AV_PICTURE_TYPE_SP,    ///< Switching Predicted
    AV_PICTURE_TYPE_BI,    ///< BI type


     * Sample aspect ratio for the video frame, 0/1 if unknown/unspecified.
    AVRational sample_aspect_ratio;

The aspect ratio of a video frame.


     * Presentation timestamp in time_base units (time when frame should be shown to user).
    int64_t pts;

Displays the timestamp. The unit is time_base.


     * PTS copied from the AVPacket that was decoded to produce this frame.
     * @deprecated use the pts field instead
    int64_t pkt_pts;

The display timestamp in the packet corresponding to this frame. This value is obtained by copying PTS from the corresponding packet (decoding to generate this frame).


     * DTS copied from the AVPacket that triggered returning this frame. (if frame threading isn't used)
     * This is also the Presentation time of this AVFrame calculated from
     * only AVPacket.dts values without pts values.
    int64_t pkt_dts;

The decoding timestamp in the packet corresponding to this frame. This value is obtained by copying DTS from the corresponding packet (decoding to generate this frame).
If there is only dts in the corresponding packet and pts is not set, this value is also the pts of this frame.


     * picture number in bitstream order
    int coded_picture_number;

The sequence number of the current image in the encoded stream.


     * picture number in display order
    int display_picture_number;

Displays the sequence number of the current image in the sequence.


     * The content of the picture is interlaced.
    int interlaced_frame;

Image progressive / interlaced mode identification.


     * Sample rate of the audio data.
    int sample_rate;

Audio sampling rate.


     * Channel layout of the audio data.
    uint64_t channel_layout;

Audio channel layout. Each bit represents a specific channel, refer to channel_ layout. The definition in H is clear at a glance:

 * @defgroup channel_masks Audio channel masks
 * A channel layout is a 64-bits integer with a bit set for every channel.
 * The number of bits set must be equal to the number of channels.
 * The value 0 means that the channel layout is not known.
 * @note this data structure is not powerful enough to handle channels
 * combinations that have the same channel multiple times, such as
 * dual-mono.
 * @{
#define AV_CH_FRONT_LEFT             0x00000001
#define AV_CH_FRONT_RIGHT            0x00000002
#define AV_CH_FRONT_CENTER           0x00000004
#define AV_CH_LOW_FREQUENCY          0x00000008

 * @}
 * @defgroup channel_mask_c Audio channel layouts
 * @{
 * */
#define AV_CH_LAYOUT_MONO              (AV_CH_FRONT_CENTER)


     * AVBuffer references backing the data for this frame. If all elements of
     * this array are NULL, then this frame is not reference counted. This array
     * must be filled contiguously -- if buf[i] is non-NULL then buf[j] must
     * also be non-NULL for all j < i.
     * There may be at most one AVBuffer per data plane, so for video this array
     * always contains all the references. For planar audio with more than
     * AV_NUM_DATA_POINTERS channels, there may be more buffers than can fit in
     * this array. Then the extra AVBufferRef pointers are stored in the
     * extended_buf array.
    AVBufferRef *buf[AV_NUM_DATA_POINTERS];

The data of this frame can be managed by AVBufferRef, which provides AVBuffer reference mechanism. This involves the concept of buffer reference counting:
AVBuffer is a commonly used buffer in FFmpeg. The buffer uses the reference counted mechanism.
AVBufferRef provides a layer of encapsulation for AVBuffer buffer. The most important thing is reference counting, which implements a security mechanism. Users should not directly access AVBuffer, but should access AVBuffer through AVBufferRef to ensure security.
Many basic data structures in FFmpeg contain AVBufferRef members to indirectly use AVBuffer buffers.
Relevant content reference“ FFmpeg data structure AVBuffer"
???? The data buffer AVBuffer of the frame is the previous data member. The user should not use the data member directly, but use the data member indirectly through the buffer member. That's extended_ What does data do????

If all elements of buf [] are NULL, this frame will not be referenced and counted. BUF [] - if buf[i] is non NULL, buf[j] must also be non NULL for all j < I.
Each plane can have at most one AVBuffer. An AVBufferRef pointer points to an AVBuffer. An AVBuffer reference refers to an AVBufferRef pointer.
For video, buffer [] contains all AVBufferRef pointers. With more than AV_ NUM_ DATA_ For planar audio of points channels, all AVBbufferRef pointers may not be buf[] saved, and the extra AVBufferRef pointers are stored in extended_buf array.


     * For planar audio which requires more than AV_NUM_DATA_POINTERS
     * AVBufferRef pointers, this array will hold all the references which
     * cannot fit into AVFrame.buf.
     * Note that this is different from AVFrame.extended_data, which always
     * contains all the pointers. This array only contains the extra pointers,
     * which cannot fit into AVFrame.buf.
     * This array is always allocated using av_malloc() by whoever constructs
     * the frame. It is freed in av_frame_unref().
    AVBufferRef **extended_buf;
     * Number of elements in extended_buf.
    int        nb_extended_buf;

With more than AV_ NUM_ DATA_ For planar audio of points channels, all AVBbufferRef pointers may not be buf[] saved, and the extra AVBufferRef pointers are stored in extended_buf array.
Note the extended here_ BUF and avframe extended_ Data, avframe extended_ Data contains all pointers to each plane, and extended_buf contains only avframe Pointer that cannot be installed in buf.
extended_buf is av when constructing a frame_ frame_ AV is called automatically in alloc()_ Malloc() to allocate space. Call av_frame_unref releases extended_buf.
nb_extended_buf is extended_ Number of elements in buf.


     * frame timestamp estimated using various heuristics, in stream time base
     * - encoding: unused
     * - decoding: set by libavcodec, read by user.
    int64_t best_effort_timestamp;



     * reordered pos from the last AVPacket that has been input into the decoder
     * - encoding: unused
     * - decoding: Read by user.
    int64_t pkt_pos;

Record the position offset of the last packet thrown into the decoder in the input file.


     * duration of the corresponding packet, expressed in
     * AVStream->time_base units, 0 if unknown.
     * - encoding: unused
     * - decoding: Read by user.
    int64_t pkt_duration;

The duration of the corresponding packet, in avstream - > time_ base.


     * number of audio channels, only used for audio.
     * - encoding: unused
     * - decoding: Read by user.
    int channels;

Number of audio channels.


     * size of the corresponding packet containing the compressed
     * frame.
     * It is set to a negative value if unknown.
     * - encoding: unused
     * - decoding: set by libavcodec, read by user.
    int pkt_size;

The size of the corresponding packet.


     * @anchor cropping
     * @name Cropping
     * Video frames only. The number of pixels to discard from the the
     * top/bottom/left/right border of the frame to obtain the sub-rectangle of
     * the frame intended for presentation.
     * @{
    size_t crop_top;
    size_t crop_bottom;
    size_t crop_left;
    size_t crop_right;
     * @}

Used for video frame image cutting. The four values are the number of pixels cut from the top / bottom / left / right boundary of the frame.

2. Instructions for related functions

2.1 av_frame_alloc()

 * Allocate an AVFrame and set its fields to default values.  The resulting
 * struct must be freed using av_frame_free().
 * @return An AVFrame filled with default values or NULL on failure.
 * @note this only allocates the AVFrame itself, not the data buffers. Those
 * must be allocated through other means, e.g. with av_frame_get_buffer() or
 * manually.
AVFrame *av_frame_alloc(void);

Construct a frame, and each member of the object is set as the default value.
This function allocates only the AVFrame object itself, not the data buffer in the AVFrame.

2.2 av_frame_free()

 * Free the frame and any dynamically allocated objects in it,
 * e.g. extended_data. If the frame is reference counted, it will be
 * unreferenced first.
 * @param frame frame to be freed. The pointer will be set to NULL.
void av_frame_free(AVFrame **frame);

Release a frame.

2.3 av_frame_ref()

 * Set up a new reference to the data described by the source frame.
 * Copy frame properties from src to dst and create a new reference for each
 * AVBufferRef from src.
 * If src is not reference counted, new buffers are allocated and the data is
 * copied.
 * @warning: dst MUST have been either unreferenced with av_frame_unref(dst),
 *           or newly allocated with av_frame_alloc() before calling this
 *           function, or undefined behavior will occur.
 * @return 0 on success, a negative AVERROR on error
int av_frame_ref(AVFrame *dst, const AVFrame *src);

Create a new reference to the data in src.
Copy the attributes of frames in src to dst, and create a new reference for each AVBufferRef in src.
If src does not use reference counting, a new data buffer will be allocated in dst, and the data from the buffer in src will be copied to the buffer in dst.

2.4 av_frame_clone()

 * Create a new frame that references the same data as src.
 * This is a shortcut for av_frame_alloc()+av_frame_ref().
 * @return newly created AVFrame on success, NULL on error.
AVFrame *av_frame_clone(const AVFrame *src);

Create a new frame. The new frame and src use the same data buffer. The buffer management uses the reference counting mechanism.
This function is equivalent to av_frame_alloc()+av_frame_ref()

2.5 av_frame_unref()

 * Unreference all the buffers referenced by frame and reset the frame fields.
void av_frame_unref(AVFrame *frame);

Release the reference of this frame to all buffers in this frame and reset all members in the frame.

2.6 av_frame_move_ref()

 * Move everything contained in src to dst and reset src.
 * @warning: dst is not unreferenced, but directly overwritten without reading
 *           or deallocating its contents. Call av_frame_unref(dst) manually
 *           before calling this function to ensure that no memory is leaked.
void av_frame_move_ref(AVFrame *dst, AVFrame *src);

Copy all data in src to dst and reset src.
To avoid memory leaks, call AV_ frame_ move_ AV should be called before ref (DST, SRC)_ frame_ unref(dst) .

2.7 av_frame_get_buffer()

 * Allocate new buffer(s) for audio or video data.
 * The following fields must be set on frame before calling this function:
 * - format (pixel format for video, sample format for audio)
 * - width and height for video
 * - nb_samples and channel_layout for audio
 * This function will fill AVFrame.data and AVFrame.buf arrays and, if
 * necessary, allocate and fill AVFrame.extended_data and AVFrame.extended_buf.
 * For planar formats, one buffer will be allocated for each plane.
 * @warning: if frame already has been allocated, calling this function will
 *           leak memory. In addition, undefined behavior can occur in certain
 *           cases.
 * @param frame frame in which to store the new buffers.
 * @param align Required buffer size alignment. If equal to 0, alignment will be
 *              chosen automatically for the current CPU. It is highly
 *              recommended to pass 0 here unless you know what you are doing.
 * @return 0 on success, a negative AVERROR on error.
int av_frame_get_buffer(AVFrame *frame, int align);

Allocate a new buffer for audio or video data.
Before calling this function, the following members in the frame must be set:

  • Format (video pixel format or audio sampling format)
  • Width, height (video picture and width and height)
  • nb_samples,channel_ Layout (number of sampling points and channel layout in a single audio channel)

This function populates avframe Data and avframe BUF array, and avframe will be allocated and filled if necessary extended_ Data and avframe extended_ buf.
For the planar format, a buffer is allocated for each plane.

2.8 av_frame_copy()

 * Copy the frame data from src to dst.
 * This function does not allocate anything, dst must be already initialized and
 * allocated with the same parameters as src.
 * This function only copies the frame data (i.e. the contents of the data /
 * extended data arrays), not any other properties.
 * @return >= 0 on success, a negative AVERROR on error.
int av_frame_copy(AVFrame *dst, const AVFrame *src);

Copy the frame data in src to dst.
This function does not have any action to allocate buffer. Before calling this function, dst must have completed initialization with the same parameters as src.
This function only copies the contents of the data buffer in the frame (the contents of the data/extended_data array), and does not involve any other attributes in the frame.

3. References

[1] FFMPEG structure analysis: AVFramehttps://blog.csdn.net/leixiaohua1020/article/details/14214577
[2] Color space and pixel formathttps://www.cnblogs.com/leisure_chn/p/10290575.html

[3] https://www.cnblogs.com/leisure_chn/p/10404502.html

4. Modify records

2019-01-13 V1.0 first draft
2021-01-06 V1.1. Add section 1.1 to fix the problem of unclear description of linesize
2021-01-16 V1.1 update the content of section 1.1 and put the detailed explanation in the article "color space and pixel format"