A VIDEO SUMMARY GENERATION MODEL BASED ON HYBRID ATTENTION USING MULTIMODAL FEATURES

A Video Summary Generation Model Based on Hybrid Attention Using Multimodal Features

A Video Summary Generation Model Based on Hybrid Attention Using Multimodal Features

Blog Article

In the process of producing read more digital media art works, it is usually necessary to select suitable parts from a lot of materials.Video summarization can classify and extract key segments from material videos based on image content, audio features, and other factors.By analyzing the video content, the redundant information is eliminated to form a short video summary.

Video summary is the key technology of Internet video distributed storage and search in we-media era.The encoder of the video summary model often ignores the modal information except the video frame, and its decoder has structural defects in the processing of long video sequences, which easily leads to excessive self-attention weight variance and low accuracy.Based on this, we propose a Video Summary Generation Model Based on Multimodal Feature Fusion and Hybrid Attention (abbr.

VSMHA), which contains a multimodal feature fusion encoder and a hybrid attention decoder.For the encoder, the image features, audio features and optical flow features are effectively fused to construct a multimodal feature fusion encoder.For the decoder, the hybrid attention mechanism (i.

e., local attention modules, global attention modules and residual inputs) is used to reduce the variance of attention weights on long videos, and the decoder is constructed by weighted fusion of context-related temporal information output by Bidirectional Gated Recurrent Unit (BiGRU) network.We conduct the experiment on four different datasets (TvSum, SumMe, OVP and YouTube), compared with other baseline models, the proposed VSMHA achieved the best results.

Specifically, when evaluated on the augmented datasets of TvSum and SumMe, the VSMHA model exhibited remarkable improvements, its performance surpassed the previous here models’ highest scores by 2.18% and 0.27% respectively.

Through these in-depth analysis, we found that the VSMHA model outperforms other two-modal fusion models in terms of robustness.This indicates that our model can maintain stable performance and better adapt to various data characteristics and application scenarios, further highlighting its superiority and practical value in video summarization tasks.

Report this page