Android FFmpeg Video Decoding Process and Practical Analysis

Overview#

This article first focuses on the theme of FFmpeg video decoding, mainly introducing the main process and basic principles of video decoding using FFmpeg; secondly, it discusses simple applications related to FFmpeg video decoding, including how to play videos in a certain timeline order based on the original FFmpeg video decoding, and how to add seek logic while playing videos; in addition, the article highlights details that may be easily overlooked during video decoding, and finally briefly explains how to encapsulate a VideoDecoder with basic video decoding functionality.

Introduction#

FFmpeg#

FFmpeg is an open-source computer program that can be used to record, convert digital audio and video, and stream them. It generates libraries for processing and manipulating multimedia data, including advanced audio and video decoding library libavcodec and audio and video format conversion library libavformat.

Six Common Functional Modules of FFmpeg#

libavformat: Library for encapsulating and de-encapsulating multimedia files or protocols, such as mp4, flv, etc. file encapsulation formats, rtmp, rtsp, etc. network protocol encapsulation formats;
libavcodec: Core library for audio and video decoding;
libavfilter: Audio, video, and subtitle filter library;
libswscale: Image format conversion library;
libswresample: Audio resampling library;
libavutil: Utility library

Basics of Video Decoding#

Demuxing: Demuxing can also be called de-encapsulation. There is a concept called encapsulation format, which refers to the combination format of audio and video, commonly seen in mp4, flv, mkv, etc. In simple terms, encapsulation is the process of combining audio streams, video streams, subtitle streams, and other attachments into a packaged product according to certain rules. De-encapsulation plays the opposite role, breaking down a stream media file into audio data, video data, etc. At this point, the split data is compressed and encoded, with common video compression data formats being h264.

Decoding: Simply put, it is the process of decompressing encoded data into raw video pixel data. Common raw video pixel data formats include yuv.

Color Space Conversion: Typically, for image displays, they use the RGB model to display images, but using the YUV model can save bandwidth when transmitting image data. Therefore, when displaying images, it is necessary to convert the yuv pixel format data into rgb pixel format before rendering.
Rendering: Sending the data of each video frame that has already been decoded and color space converted to the graphics card for drawing on the screen.

I. Preparations Before Introducing FFmpeg#

1.1 Compiling FFmpeg so Library#

Download the source library from the FFmpeg official website and unzip it;
Download the NDK library and unzip it;
Configure the configure in the unzipped FFmpeg source library directory, modifying the highlighted parameters to the following content, mainly to generate the Android usable name-version.so file format;

# ······
# build settings
SHFLAGS='-shared -Wl,-soname,$$(@F)'
LIBPREF="lib"
LIBSUF=".a"
FULLNAME='$(NAME)$(BUILDSUF)'
LIBNAME='$(LIBPREF)$(FULLNAME)$(LIBSUF)'
SLIBPREF="lib"
SLIBSUF=".so"
SLIBNAME='$(SLIBPREF)$(FULLNAME)$(SLIBSUF)'
SLIBNAME_WITH_VERSION='$(SLIBNAME).$(LIBVERSION)'

# Modified configuration
SLIBNAME_WITH_MAJOR='$(SLIBNAME)$(FULLNAME)-$(LIBMAJOR)$(SLIBSUF)'
LIB_INSTALL_EXTRA_CMD='$$(RANLIB)"$(LIBDIR)/$(LIBNAME)"'
SLIB_INSTALL_NAME='$(SLIBNAME_WITH_MAJOR)'
SLIB_INSTALL_LINKS='$(SLIBNAME)'
# ······

Create a script file build_android_arm_v8a.sh in the FFmpeg source library directory, configure the NDK path in the file, and input the following content;

# Clear previous compilation
make clean
# Configure your NDK path here
export NDK=/Users/bytedance/Library/Android/sdk/ndk/21.4.7075529
TOOLCHAIN=$NDK/toolchains/llvm/prebuilt/darwin-x86_64


function build_android
{

./configure \
--prefix=$PREFIX \
--disable-postproc \
--disable-debug \
--disable-doc \
--enable-FFmpeg \
--disable-doc \
--disable-symver \
--disable-static \
--enable-shared \
--cross-prefix=$CROSS_PREFIX \
--target-os=android \
--arch=$ARCH \
--cpu=$CPU \
--cc=$CC \
--cxx=$CXX \
--enable-cross-compile \
--sysroot=$SYSROOT \
--extra-cflags="-Os -fpic $OPTIMIZE_CFLAGS" \
--extra-ldflags="$ADDI_LDFLAGS"

make clean
make -j16
make install

echo "============================ build android arm64-v8a success =========================="

}

# arm64-v8a
ARCH=arm64
CPU=armv8-a
API=21
CC=$TOOLCHAIN/bin/aarch64-linux-android$API-clang
CXX=$TOOLCHAIN/bin/aarch64-linux-android$API-clang++
SYSROOT=$NDK/toolchains/llvm/prebuilt/darwin-x86_64/sysroot
CROSS_PREFIX=$TOOLCHAIN/bin/aarch64-linux-android-
PREFIX=$(pwd)/android/$CPU
OPTIMIZE_CFLAGS="-march=$CPU"

echo $CC

build_android

Set permissions for all files in the NDK folder chmod 777 -R NDK;
Execute the script in the terminal ./build_android_arm_v8a.sh to start compiling FFmpeg. The compiled files will be in the android directory under FFmpeg, where multiple .so files will appear;

To compile arm-v7a, simply copy and modify the above script to the following content of build_android_arm_v7a.sh.

#armv7-a
ARCH=arm
CPU=armv7-a
API=21
CC=$TOOLCHAIN/bin/armv7a-linux-androideabi$API-clang
CXX=$TOOLCHAIN/bin/armv7a-linux-androideabi$API-clang++
SYSROOT=$NDK/toolchains/llvm/prebuilt/darwin-x86_64/sysroot
CROSS_PREFIX=$TOOLCHAIN/bin/arm-linux-androideabi-
PREFIX=$(pwd)/android/$CPU
OPTIMIZE_CFLAGS="-mfloat-abi=softfp -mfpu=vfp -marm -march=$CPU "

1.2 Introducing FFmpeg's so Library in Android#

NDK environment, CMake build tools, LLDB (C/C++ code debugging tool);
Create a C++ module, which will generally generate the following important files: CMakeLists.txt, native-lib.cpp, MainActivity;
In the app/src/main/ directory, create a directory named jniLibs, which is the default directory for placing so dynamic libraries in Android Studio; then create an arm64-v8a directory under the jniLibs directory and paste the compiled .so files into this directory; then paste the generated .h header files (the interfaces exposed by FFmpeg) into the include in the cpp directory. The above .so dynamic library directory and .h header file directory will be explicitly declared and linked in CMakeLists.txt;
In the top-level MainActivity, load the C/C++ compiled library: native-lib. native-lib is added to a library named "ffmpeg" in CMakeLists.txt, so "ffmpeg" is entered in System.loadLibrary();

class MainActivity : AppCompatActivity() {

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)

        // Example of a call to a native method
        sample_text.text = stringFromJNI()
    }

    // Declare an external reference method, which corresponds to the C/C++ layer code.
    external fun stringFromJNI(): String

    companion object {

        // Load the C/C++ compiled library: ffmpeg in init{}
        // The definition and addition of the library name are completed in CMakeLists.txt
        init {
            System.loadLibrary("ffmpeg")
        }
    }
}

native-lib.cpp is a C++ interface file, where the external method declared in the Java layer is implemented;

#include <jni.h>
#include <string>
extern "C" JNIEXPORT jstring JNICALL
Java_com_bytedance_example_MainActivity_stringFromJNI(
        JNIEnv *env,
        jobject /* this */) {
    std::string hello = "Hello from C++";
    return env->NewStringUTF(hello.c_str());
}

CMakeLists.txt is a build script that configures the build information to compile the native-lib so library;

# For more information about using CMake with Android Studio, read the
# documentation: https://d.android.com/studio/projects/add-native-code.html

# Sets the minimum version of CMake required to build the native library.

cmake_minimum_required(VERSION 3.10.2)

# Declares and names the project.

project("ffmpeg")

# Creates and names a library, sets it as either STATIC
# or SHARED, and provides the relative paths to its source code.
# You can define multiple libraries, and CMake builds them for you.
# Gradle automatically packages shared libraries with your APK.

# Define the directory for so libraries and header files for later use
set(FFmpeg_lib_dir ${CMAKE_SOURCE_DIR}/../jniLibs/${ANDROID_ABI})
set(FFmpeg_head_dir ${CMAKE_SOURCE_DIR}/FFmpeg)

# Add header file directory
include_directories(
        FFmpeg/include
)

add_library( # Sets the name of the library.
        ffmmpeg

        # Sets the library as a shared library.
        SHARED

        # Provides a relative path to your source file(s).
        native-lib.cpp
        )

# Searches for a specified prebuilt library and stores the path as a
# variable. Because CMake includes system libraries in the search path by
# default, you only need to specify the name of the public NDK library
# you want to add. CMake verifies that the library exists before
# completing its build.

# Add FFmpeg related so libraries
add_library( avutil
        SHARED
        IMPORTED )
set_target_properties( avutil
        PROPERTIES IMPORTED_LOCATION
        ${FFmpeg_lib_dir}/libavutil.so )
add_library( swresample
        SHARED
        IMPORTED )
set_target_properties( swresample
        PROPERTIES IMPORTED_LOCATION
        ${FFmpeg_lib_dir}/libswresample.so )

add_library( avcodec
        SHARED
        IMPORTED )
set_target_properties( avcodec
        PROPERTIES IMPORTED_LOCATION
        ${FFmpeg_lib_dir}/libavcodec.so )


find_library( # Sets the name of the path variable.
        log-lib

        # Specifies the name of the NDK library that
        # you want CMake to locate.
        log)

# Specifies libraries CMake should link to your target library. You
# can link multiple libraries, such as libraries you define in this
# build script, prebuilt third-party libraries, or system libraries.

target_link_libraries( # Specifies the target library.
        audioffmmpeg

        # Links the target library to the log library
        # included in the NDK.
        ${log-lib})

The above operations will introduce FFmpeg into the Android project.

II. Principles and Details of FFmpeg Video Decoding#

2.1 Main Process#

2.2 Basic Principles#

2.2.1 Common FFmpeg Interfaces#

// 1 Allocate AVFormatContext
avformat_alloc_context();
// 2 Open file input stream
avformat_open_input(AVFormatContext **ps, const char *url,
                        const AVInputFormat *fmt, AVDictionary **options);
// 3 Extract data stream information from the input file
avformat_find_stream_info(AVFormatContext *ic, AVDictionary **options);
// 4 Allocate codec context
avcodec_alloc_context3(const AVCodec *codec);
// 5 Fill codec context based on codec parameters related to the data stream
avcodec_parameters_to_context(AVCodecContext *codec,
                                  const AVCodecParameters *par);
// 6 Find the corresponding registered codec
avcodec_find_decoder(enum AVCodecID id);
// 7 Open codec
avcodec_open2(AVCodecContext *avctx, const AVCodec *codec, AVDictionary **options);
// 8 Continuously extract compressed frame data from the stream, obtaining compressed data for one frame of video
av_read_frame(AVFormatContext *s, AVPacket *pkt);
// 9 Send raw compressed data to the decoder (compressed data)
avcodec_send_packet(AVCodecContext *avctx, const AVPacket *avpkt);
// 10 Receive decoded data output from the decoder
avcodec_receive_frame(AVCodecContext *avctx, AVFrame *frame);

2.2.2 Overall Idea of Video Decoding#

First, register libavformat and all codecs, demuxers/muxers, protocols, etc. It is the first function called in all applications based on FFmpeg; only after calling this function can the various functions of FFmpeg be used normally. Additionally, in the latest version of FFmpeg, this line of code is no longer necessary;

av_register_all();

Open the video file and extract the data stream information from the file;

auto av_format_context = avformat_alloc_context();
avformat_open_input(&av_format_context, path_.c_str(), nullptr, nullptr);
avformat_find_stream_info(av_format_context, nullptr);

Then obtain the index of the video media stream to find the video media stream in the file;

int video_stream_index = -1;
for (int i = 0; i < av_format_context->nb_streams; i++) {
    // Match and find the index of the video media stream,
    if (av_format_context->streams[i]->codecpar->codec_type == AVMEDIA_TYPE_VIDEO) {
        video_stream_index = i;
        LOGD(TAG, "find video stream index = %d", video_stream_index);
        break;
    }
}

Get the video media stream, obtain the decoder context, configure the parameters of the decoder context, and open the decoder;

// Get the video media stream
auto stream = av_format_context->streams[video_stream_index];
// Find the registered decoder
auto codec = avcodec_find_decoder(stream->codecpar->codec_id);
// Get the decoder context
AVCodecContext* codec_ctx = avcodec_alloc_context3(codec);
// Configure the video media stream parameters to the decoder context
auto ret = avcodec_parameters_to_context(codec_ctx, stream->codecpar);

if (ret >= 0) {
    // Open the decoder
    avcodec_open2(codec_ctx, codec, nullptr);
    // ······
}

Calculate the required memory size for the buffer based on the specified pixel format, image width, and height; allocate and set the buffer; and since it is for screen drawing, we need to use ANativeWindow, using ANativeWindow_setBuffersGeometry to set the properties of this drawing window;

video_width_ = codec_ctx->width;
video_height_ = codec_ctx->height;

int buffer_size = av_image_get_buffer_size(AV_PIX_FMT_RGBA,
                                           video_width_, video_height_, 1);
// Output buffer
out_buffer_ = (uint8_t*) av_malloc(buffer_size * sizeof(uint8_t));
// Limit the number of pixels in the buffer by setting the width and height, rather than the physical display size of the screen.
// If the buffer does not match the display size of the screen, the actual display may be stretched or compressed.
int result = ANativeWindow_setBuffersGeometry(native_window_, video_width_,
                                              video_height_, WINDOW_FORMAT_RGBA_8888);

Allocate memory space for an AVFrame with pixel format RGBA to store the frame data converted to RGBA; set the rgba_frame buffer to associate it with out_buffer_;

auto rgba_frame = av_frame_alloc();
av_image_fill_arrays(rgba_frame->data, rgba_frame->linesize,
                     out_buffer_,
                     AV_PIX_FMT_RGBA,
                     video_width_, video_height_, 1);

Obtain SwsContext, which will be used when calling sws_scale() for image format conversion and scaling. When converting YUV420P to RGBA, it may fail to return the correct height value when calling sws_scale, due to the flags used in sws_getContext, which need to be changed from SWS_BICUBIC to SWS_FULL_CHR_H_INT | SWS_ACCURATE_RND;

struct SwsContext* data_convert_context = sws_getContext(
                    video_width_, video_height_, codec_ctx->pix_fmt,
                    video_width_, video_height_, AV_PIX_FMT_RGBA,
                    SWS_BICUBIC, nullptr, nullptr, nullptr);

Allocate memory space for AVFrame to store raw data, pointing to the raw frame data; and allocate memory space for AVPacket to store data before video decoding;

auto frame = av_frame_alloc();
auto packet = av_packet_alloc();

Loop to read compressed frame data from the video stream, then start decoding;

ret = av_read_frame(av_format_context, packet);
if (packet->size) {
    Decode(codec_ctx, packet, frame, stream, lock, data_convert_context, rgba_frame);
}

In the Decode() function, send the packet containing the raw compressed data as input to the decoder;

/* send the packet with the compressed data to the decoder */
ret = avcodec_send_packet(codec_ctx, pkt);

The decoder returns the decoded frame data to the specified frame, which can then be used to convert the pts of the decoded frame into a timestamp, drawing each frame in the display order according to the timeline;

while (ret >= 0 && !is_stop_) {
    // Return the decoded data to frame
    ret = avcodec_receive_frame(codec_ctx, frame);
    if (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF) {
        return;
    } else if (ret < 0) {
        return;
    }
    // Get the current decoded frame, convert its pts into a timestamp for comparison with the specified timestamp
    auto decode_time_ms = frame->pts * 1000 / stream->time_base.den;
    if (decode_time_ms >= time_ms_) {
        last_decode_time_ms_ = decode_time_ms;
        is_seeking_ = false;
        // ······
        // Image data format conversion
        // ······
        // Draw the converted data on the screen
    }
    av_packet_unref(pkt);
}

Before drawing the frame, perform image data format conversion, which requires the previously obtained SwsContext;

// Image data format conversion
int result = sws_scale(
        sws_context,
        (const uint8_t* const*) frame->data, frame->linesize,
        0, video_height_,
        rgba_frame->data, rgba_frame->linesize);

if (result <= 0) {
    LOGE(TAG, "Player Error : data convert fail");
    return;
}

Since it is for screen drawing, ANativeWindow and ANativeWindow_Buffer are used. Before drawing the frame, lock the next drawing surface of the window for drawing, then write the frame data to the buffer, and finally unlock the drawing surface of the window to publish the buffer data to the screen;

// Playing
result = ANativeWindow_lock(native_window_, &window_buffer_, nullptr);
if (result < 0) {
    LOGE(TAG, "Player Error : Can not lock native window");
} else {
    // Draw the image on the interface
    // Note: The length of pixels in one row of rgba_frame and window_buffer may not match
    // It needs to be converted properly, otherwise it may cause screen artifacts
    auto bits = (uint8_t*) window_buffer_.bits;
    for (int h = 0; h < video_height_; h++) {
        memcpy(bits + h * window_buffer_.stride * 4,
               out_buffer_ + h * rgba_frame->linesize[0],
               rgba_frame->linesize[0]);
    }
    ANativeWindow_unlockAndPost(native_window_);
}

The above is the main decoding process. Additionally, since C++ requires manual resource and memory management, after decoding ends, it is necessary to call the release interface to free resources to avoid memory leaks.

sws_freeContext(data_convert_context);
av_free(out_buffer_);
av_frame_free(&rgba_frame);
av_frame_free(&frame);
av_packet_free(&packet);

avcodec_close(codec_ctx);
avcodec_free_context(&codec_ctx);

avformat_close_input(&av_format_context);
avformat_free_context(av_format_context);
ANativeWindow_release(native_window_);

2.3 Simple Applications#

To better understand the video decoding process, here we encapsulate a video decoder VideoDecoder, which will initially have the following functions:

VideoDecoder(const char* path, std::function<void(long timestamp)> on_decode_frame);

void Prepare(ANativeWindow* window);

bool DecodeFrame(long time_ms);

void Release();

In this video decoder, inputting a specified timestamp will return the decoded frame data. Among them, the DecodeFrame(long time_ms) function is particularly important, as it can be called by the user to pass in the timestamp of the specified frame, thus decoding the corresponding frame data. Additionally, synchronization locks can be added to achieve separation between the decoding thread and the usage thread.

2.3.1 Adding Synchronization Lock for Video Playback#

If the goal is only to decode the video, synchronization waiting is not necessary;

However, if video playback is to be implemented, a lock must be used for synchronization waiting after decoding and drawing each frame, as video playback requires separation of decoding and drawing, and decoding and drawing must proceed in a certain timeline order and speed.

condition_.wait(lock);

When calling the DecodeFrame function at a higher level and passing in the decoding timestamp, it wakes up the synchronization lock, allowing the decoding and drawing loop to continue executing.

bool VideoDecoder::DecodeFrame(long time_ms) {
    // ······
    time_ms_ = time_ms;
    condition_.notify_all();
    return true;
}

2.3.2 Adding seek_frame During Playback#

In normal playback, the video is decoded frame by frame; however, when dragging the progress bar to reach a specified seek point, if it still decodes from the beginning to the seek point frame by frame, the efficiency may not be very high. In this case, it is necessary to check the timestamp of the seek point within certain rules and directly seek to the specified timestamp if the conditions are met.

av_seek_frame in FFmpeg#

av_seek_frame can locate both keyframes and non-keyframes, depending on the selected flag value. Since video decoding relies on keyframes, we generally need to locate the keyframe;

int av_seek_frame(AVFormatContext *s, int stream_index, int64_t timestamp,
                  int flags);

The flag in av_seek_frame is used to specify the positional relationship between the I-frame to be found and the input timestamp. When seeking to a past timestamp, the timestamp may not exactly be at the position of the I-frame, but since decoding relies on I-frames, it is necessary to first find an I-frame near this timestamp. At this point, the flag indicates whether to seek to the previous or next I-frame of the current timestamp;
The flag has four options:

Flag Option	Description
AVSEEK_FLAG_BACKWARD	The first flag seeks to the nearest keyframe before the requested timestamp. Normally, seeking is in ms, and if the specified ms timestamp is not a keyframe (which is likely), it will automatically seek back to the nearest keyframe. Although this flag positioning is not very precise, it can handle mosaic issues well, as the BACKWARD method will look back to the keyframe position.
AVSEEK_FLAG_BYTE	The second flag seeks to the corresponding position in the file (byte representation), which is completely consistent with AVSEEK_FLAG_FRAME, but the search algorithm is different.
AVSEEK_FLAG_ANY	The third flag can seek to any frame, not necessarily a keyframe, so using it may result in screen artifacts (mosaic), but the progress and manual sliding are completely consistent.
AVSEEK_FLAG_FRAME	The fourth flag seeks to the frame number corresponding to the timestamp, which can be understood as finding the nearest keyframe backward, in the opposite direction of BACKWARD.

The flag may contain multiple values simultaneously. For example, AVSEEK_FLAG_BACKWARD | AVSEEK_FLAG_BYTE;
FRAME and BACKWARD estimate the target position of the seek based on the interval between frames, suitable for fast forward and rewind; BYTE is suitable for large sliding.

Seek Scenarios#

If the timestamp passed during decoding is moving forward and exceeds a certain distance from the previous frame timestamp, seeking is required; this "certain distance" is estimated through multiple experiments and is not always the 1000ms used in the following code;
If moving backward and less than the last decoded timestamp but with a considerable distance (e.g., more than 50ms), seek to the previous keyframe;
The use of the boolean variable is_seeking_ is to prevent other operations from interfering with the current seeking operation, ensuring that only one seek operation is in progress at a time.

if (!is_seeking_ && (time_ms_ > last_decode_time_ms_ + 1000 ||
                     time_ms_ < last_decode_time_ms_ - 50)) {
    is_seeking_ = true;
    // The timestamp passed in for seeking is the specified frame with the time_base, so it needs to be calculated using times_ms
    LOGD(TAG, "seek frame time_ms_ = %ld， last_decode_time_ms_ = %ld", time_ms_,
         last_decode_time_ms_);
    av_seek_frame(av_format_context,
                  video_stream_index,
                  time_ms_ * stream->time_base.den / 1000,
                  AVSEEK_FLAG_BACKWARD);
}

Inserting Seek Logic#

Since the seek check must be performed before decoding, the seek logic should be inserted before the av_read_frame function (which returns the next frame of the video media stream). If the seek conditions are met, use av_seek_frame to reach the specified I-frame, and then continue decoding to the target timestamp after av_read_frame.

// Insert seek logic here
// Next is to read the next frame of the video stream
int ret = av_read_frame(av_format_context, packet);

2.4 Details During Decoding Process#

2.4.1 Conditions for Seeking During DecodeFrame#

When using the av_seek_frame function, it is necessary to specify the correct flag, and also to agree on the conditions for performing the seek operation; otherwise, the video may exhibit screen artifacts (mosaic).

if (!is_seeking_ && (time_ms_ > last_decode_time_ms_ + 1000 ||
                     time_ms_ < last_decode_time_ms_ - 50)) {
    is_seeking_ = true;
    av_seek_frame(···,···,···,AVSEEK_FLAG_BACKWARD);
}

2.4.2 Reducing the Number of Decodes#

During video decoding, under certain conditions, it is possible to avoid decoding the frame data corresponding to the input timestamp. For example:

If the current decoding timestamp is moving forward and is the same as or equal to the last decoded timestamp, there is no need to decode;
If the current decoding timestamp is not greater than the last decoded timestamp and the distance between the current and last decoded timestamps is small (e.g., not exceeding 50ms), there is no need to decode.

bool VideoDecoder::DecodeFrame(long time_ms) {
    LOGD(TAG, "DecodeFrame time_ms = %ld", time_ms);
    if (last_decode_time_ms_ == time_ms || time_ms_ == time_ms) {
        LOGD(TAG, "DecodeFrame last_decode_time_ms_ == time_ms");
        return false;
    }
    if (time_ms <= last_decode_time_ms_ &&
        time_ms + 50 >= last_decode_time_ms_) {
        return false;
    }
    time_ms_ = time_ms;
    condition_.notify_all();
    return true;
}

With these constraints, unnecessary decoding operations can be reduced.

2.4.3 Using AVFrame's pts#

AVPacket stores data before decoding (encoded data: H264/AAC, etc.), which is still compressed data after demuxing; AVFrame stores data after decoding (pixel data: YUV/RGB/PCM, etc.);
The pts of AVPacket and pts of AVFrame have different meanings. The former indicates when this decompressed packet will be presented, while the latter indicates when the frame data should be displayed;

// pts of AVPacket
   /**
    * Presentation timestamp in AVStream->time_base units; the time at which
    * the decompressed packet will be presented to the user.
    * Can be AV_NOPTS_VALUE if it is not stored in the file.
    * pts MUST be larger or equal to dts as presentation cannot happen before
    * decompression, unless one wants to view hex dumps. Some formats misuse
    * the terms dts and pts/cts to mean something different. Such timestamps
    * must be converted to true pts/dts before they are stored in AVPacket.
    */
   int64_t pts;

   // pts of AVFrame
   /**
    * Presentation timestamp in time_base units (time when frame should be shown to user).
    */
   int64_t pts;

Whether to draw the currently decoded frame data on the screen depends on the comparison result between the input decoding timestamp and the timestamp of the already decoded frame returned by the decoder. Here, the pts of AVPacket should not be used, as it may not be an increasing timestamp;
The premise for drawing the screen is: when the specified decoding timestamp passed in is not greater than the timestamp converted from the currently decoded frame's pts.

auto decode_time_ms = frame->pts * 1000 / stream->time_base.den;
LOGD(TAG, "decode_time_ms = %ld", decode_time_ms);
if (decode_time_ms >= time_ms_) {
    last_decode_time_ms_ = decode_time_ms;
    is_seeking = false;
    // Draw the screen
    // ····
}

2.4.4 No Data When Decoding the Last Frame#

Using av_read_frame(av_format_context, packet) returns the next frame of the video media stream to AVPacket. If the returned int value is 0, it indicates Success, while if it is less than 0, it indicates Error or EOF.

Therefore, if a value less than 0 is returned while playing the video, call the avcodec_flush_buffers function to reset the decoder's state, flush the contents in the buffer, then seek to the current input timestamp, complete the decoding callback, and let the synchronization lock wait.

// Read several audio frames or one video frame from the stream,
// Here is reading one video frame (a complete frame), obtaining the compressed data for one frame of video, which can then be decoded.
ret = av_read_frame(av_format_context, packet);
if (ret < 0) {
    avcodec_flush_buffers(codec_ctx);
    av_seek_frame(av_format_context, video_stream_index,
                  time_ms_ * stream->time_base.den / 1000, AVSEEK_FLAG_BACKWARD);
    LOGD(TAG, "ret < 0, condition_.wait(lock)");
    // Prevent decoding the last frame when there is no data left in the video
    on_decode_frame_(last_decode_time_ms_);
    condition_.wait(lock);
}

2.5 Upper Layer Encapsulation of Decoder VideoDecoder#

If you want to encapsulate a VideoDecoder at the upper layer, simply expose the C++ layer VideoDecoder interface in native-lib.cpp, and then the upper layer can call the C++ interface through JNI.

For example, when the upper layer wants to pass in a specified decoding timestamp for decoding, write a decodeFrame method, and then pass the timestamp to the C++ layer's nativeDecodeFrame for decoding, while the implementation of the nativeDecodeFrame method is written in native-lib.cpp.

// FFmpegVideoDecoder.kt
class FFmpegVideoDecoder(
    path: String,
    val onDecodeFrame: (timestamp: Long, texture: SurfaceTexture, needRender: Boolean) -> Unit
){
    // Extract the timeMs frame, based on whether sync is synchronous waiting
    fun decodeFrame(timeMS: Long, sync: Boolean = false) {
        // If no frame extraction is needed, do not wait
        if (nativeDecodeFrame(decoderPtr, timeMS) && sync) {
            // ······
    } else {
            // ······
        }
    }

    private external fun nativeDecodeFrame(decoder: Long, timeMS: Long): Boolean

    companion object {
        const val TAG = "FFmpegVideoDecoder"

        init {
            System.loadLibrary("ffmmpeg")

        }
    }
}

Then in native-lib.cpp, call the C++ layer VideoDecoder interface DecodeFrame, thus establishing a connection between the upper layer and the C++ lower layer through JNI.

// native-lib.cpp
extern "C"
JNIEXPORT jboolean JNICALL
Java_com_example_decoder_video_FFmpegVideoDecoder_nativeDecodeFrame(JNIEnv* env,
                                                               jobject thiz,
                                                               jlong decoder,
                                                               jlong time_ms) {
    auto videoDecoder = (codec::VideoDecoder*)decoder;
    return videoDecoder->DecodeFrame(time_ms);
}

III. Insights#

Technical Experience

Integrating FFmpeg after compilation with Android to achieve video decoding and playback is highly convenient.
Since the specific decoding process is implemented in C++, there may be a learning curve; it is best to have a certain foundation in C++.

IV. Appendix#

C++ Encapsulated VideoDecoder

VideoDecoder.h

#include <jni.h>
#include <mutex>
#include <android/native_window.h>
#include <android/native_window_jni.h>
#include <time.h>

extern "C" {
#include <libavformat/avformat.h>
#include <libavcodec/avcodec.h>
#include <libswresample/swresample.h>
#include <libswscale/swscale.h>
}
#include <string>
/*
 * VideoDecoder can be used to decode video media stream data from a certain audio and video file (e.g., .mp4).
 * After passing in the specified file path from the Java layer, it can loop through the specified timestamp for decoding (frame extraction) at a certain fps, which is implemented by the DecodeFrame provided by C++.
 * At the end of each decoding, the timestamp of the decoded frame is called back to the upper layer decoder for other operations.
 */
namespace codec {
class VideoDecoder {

private:
    std::string path_;
    long time_ms_ = -1;
    long last_decode_time_ms_ = -1;
    bool is_seeking_ = false;
    ANativeWindow* native_window_ = nullptr;
    ANativeWindow_Buffer window_buffer_{};、
    // Video width and height properties
    int video_width_ = 0;
    int video_height_ = 0;
    uint8_t* out_buffer_ = nullptr;
    // on_decode_frame is used to callback the extracted timestamp of the specified frame to the upper layer decoder for other operations.
    std::function<void(long timestamp)> on_decode_frame_ = nullptr;
    bool is_stop_ = false;

    // Will be used in conjunction with the lock "std::unique_lock<std::mutex>"
    std::mutex work_queue_mtx;
    // The property that actually performs synchronization waiting and waking
    std::condition_variable condition_;
    // The function that actually performs decoding
    void Decode(AVCodecContext* codec_ctx, AVPacket* pkt, AVFrame* frame, AVStream* stream,
                std::unique_lock<std::mutex>& lock, SwsContext* sws_context, AVFrame* pFrame);

public:
    // When creating a decoder, you need to pass in the media file path and a callback on_decode_frame after decoding.
    VideoDecoder(const char* path, std::function<void(long timestamp)> on_decode_frame);
    // Wrap the Surface passed in from the upper layer in the JNI layer to create a new ANativeWindow, which will be used when drawing frame data after decoding
    void Prepare(ANativeWindow* window);
    // Extract the specified timestamp video frame, can be called by the upper layer
    bool DecodeFrame(long time_ms);
    // Release decoder resources
    void Release();
    // Get the current system millisecond time
    static int64_t GetCurrentMilliTime(void);
};

}

VideoDecoder.cpp

#include "VideoDecoder.h"
#include "../log/Logger.h"
#include <thread>
#include <utility>

extern "C" {
#include <libavutil/imgutils.h>
}

#define TAG "VideoDecoder"
namespace codec {

VideoDecoder::VideoDecoder(const char* path, std::function<void(long timestamp)> on_decode_frame)
        : on_decode_frame_(std::move(on_decode_frame)) {
    path_ = std::string(path);
}

void VideoDecoder::Decode(AVCodecContext* codec_ctx, AVPacket* pkt, AVFrame* frame, AVStream* stream,
                     std::unique_lock<std::mutex>& lock, SwsContext* sws_context,
                     AVFrame* rgba_frame) {

    int ret;
    /* send the packet with the compressed data to the decoder */
    ret = avcodec_send_packet(codec_ctx, pkt);
    if (ret == AVERROR(EAGAIN)) {
        LOGE(TAG,
             "Decode: Receive_frame and send_packet both returned EAGAIN, which is an API violation.");
    } else if (ret < 0) {
        return;
    }

    // Read all the output frames (infile general there may be any number of them)
    while (ret >= 0 && !is_stop_) {
        // For frame, avcodec_receive_frame internally calls each time
        ret = avcodec_receive_frame(codec_ctx, frame);
        if (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF) {
            return;
        } else if (ret < 0) {
            return;
        }
        int64_t startTime = GetCurrentMilliTime();
        LOGD(TAG, "decodeStartTime: %ld", startTime);
        // Calculate the timestamp of the currently decoded frame
        auto decode_time_ms = frame->pts * 1000 / stream->time_base.den;
        LOGD(TAG, "decode_time_ms = %ld", decode_time_ms);
        if (decode_time_ms >= time_ms_) {
            LOGD(TAG, "decode decode_time_ms = %ld, time_ms_ = %ld", decode_time_ms, time_ms_);
            last_decode_time_ms_ = decode_time_ms;
            is_seeking_ = false;

            // Data format conversion
            int result = sws_scale(
                    sws_context,
                    (const uint8_t* const*) frame->data, frame->linesize,
                    0, video_height_,
                    rgba_frame->data, rgba_frame->linesize);

            if (result <= 0) {
                LOGE(TAG, "Player Error : data convert fail");
                return;
            }

            // Playing
            result = ANativeWindow_lock(native_window_, &window_buffer_, nullptr);
            if (result < 0) {
                LOGE(TAG, "Player Error : Can not lock native window");
            } else {
                // Draw the image on the interface
                auto bits = (uint8_t*) window_buffer_.bits;
                for (int h = 0; h < video_height_; h++) {
                    memcpy(bits + h * window_buffer_.stride * 4,
                           out_buffer_ + h * rgba_frame->linesize[0],
                           rgba_frame->linesize[0]);
                }
                ANativeWindow_unlockAndPost(native_window_);
            }
            on_decode_frame_(decode_time_ms);
            int64_t endTime = GetCurrentMilliTime();
            LOGD(TAG, "decodeEndTime - decodeStartTime: %ld", endTime - startTime);
            LOGD(TAG, "finish decode frame");
            condition_.wait(lock);
        }
        // Mainly serves to clean up all space data in AVPacket, after cleaning, initialize operations, and set data and size to 0 for convenience in the next call.
        // Release packet reference
        av_packet_unref(pkt);
    }
}

void VideoDecoder::Prepare(ANativeWindow* window) {
    native_window_ = window;
    av_register_all();
    auto av_format_context = avformat_alloc_context();
    avformat_open_input(&av_format_context, path_.c_str(), nullptr, nullptr);
    avformat_find_stream_info(av_format_context, nullptr);
    int video_stream_index = -1;
    for (int i = 0; i < av_format_context->nb_streams; i++) {
        // Find the index of the video media stream
        if (av_format_context->streams[i]->codecpar->codec_type == AVMEDIA_TYPE_VIDEO) {
            video_stream_index = i;
            LOGD(TAG, "find video stream index = %d", video_stream_index);
            break;
        }
    }

    // Run once
    do {
        if (video_stream_index == -1) {
            codec::LOGE(TAG, "Player Error : Can not find video stream");
            break;
        }
        std::unique_lock<std::mutex> lock(work_queue_mtx);

        // Get the video media stream
        auto stream = av_format_context->streams[video_stream_index];
        // Find the registered decoder
        auto codec = avcodec_find_decoder(stream->codecpar->codec_id);
        // Get the decoder context
        AVCodecContext* codec_ctx = avcodec_alloc_context3(codec);
        auto ret = avcodec_parameters_to_context(codec_ctx, stream->codecpar);

        if (ret >= 0) {
            // Open
            avcodec_open2(codec_ctx, codec, nullptr);
            // The decoder has width and height values only after opening
            video_width_ = codec_ctx->width;
            video_height_ = codec_ctx->height;

            AVFrame* rgba_frame = av_frame_alloc();
            int buffer_size = av_image_get_buffer_size(AV_PIX_FMT_RGBA, video_width_, video_height_,
                                                       1);
            // Allocate memory space for output buffer
            out_buffer_ = (uint8_t*) av_malloc(buffer_size * sizeof(uint8_t));
            av_image_fill_arrays(rgba_frame->data, rgba_frame->linesize, out_buffer_,
                                 AV_PIX_FMT_RGBA,
                                 video_width_, video_height_, 1);

            // Limit the number of pixels in the buffer by setting width and height, rather than the physical display size of the screen.
            // If the buffer does not match the physical display size of the screen, the actual display may be stretched or compressed.
            int result = ANativeWindow_setBuffersGeometry(native_window_, video_width_,
                                                          video_height_, WINDOW_FORMAT_RGBA_8888);
            if (result < 0) {
                LOGE(TAG, "Player Error : Can not set native window buffer");
                avcodec_close(codec_ctx);
                avcodec_free_context(&codec_ctx);
                av_free(out_buffer_);
                break;
            }

            auto frame = av_frame_alloc();
            auto packet = av_packet_alloc();

            struct SwsContext* data_convert_context = sws_getContext(
                    video_width_, video_height_, codec_ctx->pix_fmt,
                    video_width_, video_height_, AV_PIX_FMT_RGBA,
                    SWS_BICUBIC, nullptr, nullptr, nullptr);
            while (!is_stop_) {
                LOGD(TAG, "front seek time_ms_ = %ld, last_decode_time_ms_ = %ld", time_ms_,
                     last_decode_time_ms_);
                if (!is_seeking_ && (time_ms_ > last_decode_time_ms_ + 1000 ||
                                     time_ms_ < last_decode_time_ms_ - 50)) {
                    is_seeking_ = true;
                    LOGD(TAG, "seek frame time_ms_ = %ld， last_decode_time_ms_ = %ld", time_ms_,
                         last_decode_time_ms_);
                    // The timestamp passed in for seeking is the specified frame with the time_base, so it needs to be calculated using times_ms
                    av_seek_frame(av_format_context, video_stream_index,
                                  time_ms_ * stream->time_base.den / 1000, AVSEEK_FLAG_BACKWARD);
                }
                // Read one video frame (a complete frame), obtaining the compressed data for one frame of video, which can then be decoded.
                ret = av_read_frame(av_format_context, packet);
                if (ret < 0) {
                    avcodec_flush_buffers(codec_ctx);
                    av_seek_frame(av_format_context, video_stream_index,
                                  time_ms_ * stream->time_base.den / 1000, AVSEEK_FLAG_BACKWARD);
                    LOGD(TAG, "ret < 0, condition_.wait(lock)");
                    // Prevent decoding the last frame when there is no data left in the video
                    on_decode_frame_(last_decode_time_ms_);
                    condition_.wait(lock);
                }
                if (packet->size) {
                    Decode(codec_ctx, packet, frame, stream, lock, data_convert_context,
                           rgba_frame);
                }
            }
            // Release resources
            sws_freeContext(data_convert_context);
            av_free(out_buffer_);
            av_frame_free(&rgba_frame);
            av_frame_free(&frame);
            av_packet_free(&packet);

        }
        avcodec_close(codec_ctx);
        avcodec_free_context(&codec_ctx);

    } while (false);
    avformat_close_input(&av_format_context);
    avformat_free_context(av_format_context);
    ANativeWindow_release(native_window_);
    delete this;
}

bool VideoDecoder::DecodeFrame(long time_ms) {
    LOGD(TAG, "DecodeFrame time_ms = %ld", time_ms);
    if (last_decode_time_ms_ == time_ms || time_ms_ == time_ms) {
        LOGD(TAG, "DecodeFrame last_decode_time_ms_ == time_ms");
        return false;
    }
    if (last_decode_time_ms_ >= time_ms && last_decode_time_ms_ <= time_ms + 50) {
        return false;
    }
    time_ms_ = time_ms;
    condition_.notify_all();
    return true;
}

void VideoDecoder::Release() {
    is_stop_ = true;
    condition_.notify_all();
}

/**
 * Get the current millisecond-level time
 */
int64_t VideoDecoder::GetCurrentMilliTime(void) {
    struct timeval tv{};
    gettimeofday(&tv, nullptr);
    return tv.tv_sec * 1000.0 + tv.tv_usec / 1000.0;
}

}