Vision-text time series correlation for visual-to-language story generation

R.S. Perdana, - (2021) Vision-text time series correlation for visual-to-language story generation. IEICE Transactions on Information and Systems.

Official URL: https://doi.org/10.1587/transinf.2020EDP7131

Abstract

Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.

English Abstract

Item Type:	Article
Depositing User:	Bambang Septiawan
Date Deposited:	16 Dec 2021 04:14
Last Modified:	16 Dec 2021 04:14
URI:	http://repository.ub.ac.id/id/eprint/187333

Full text not available from this repository.

Actions (login required)

View Item