Abstract
The exponential growth of multimedia content in the digital age has necessitated the development of advanced cross-lingual systems capable of understanding and interpreting visual information across different languages. However, current efforts have predominantly been focused on monolingual tasks, leaving a substantial gap in cross-lingual multimedia analysis, particularly for non-English languages. To address this gap, AraTraditions10k, a comprehensive and culturally rich dataset, has been introduced to enhance cross-lingual image annotation, retrieval, and tagging, with a specific focus on Arabic and English languages. The dataset consists of 10,000 carefully curated images representing diverse aspects of Arabic culture, each annotated with five captions in Modern Standard Arabic (MSA) and professionally translated into English. To maximize the utility of the dataset, advanced machine learning models, including a Multi-Layer Perceptron (MLP) for tag recommendation and an enhanced Word2VisualVec (W2VV) model for sentence recommendation, have been developed. These models have been augmented with attention mechanisms and contrastive loss functions, resulting in measurable performance improvements. Notably, the tag recommendation system achieved an overall top-1 accuracy of 93%, while the sentence recommendation system for the English language attained BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE scores of 78.2, 68.3, 75.8, 136.7, and 52.0, respectively. By addressing the linguistic and cultural gaps in existing datasets, AraTraditions10k establishes a new benchmark for the quality and inclusivity of multilingual datasets, contributing to the broader field of cross-lingual multimedia analysis and facilitating the development of more accessible and culturally sensitive multimedia technologies.
Similar content being viewed by others
Introduction
The rapid advancement of multimedia technology and increasing globalization necessitate the development of systems that can understand and interpret multimedia content in multiple languages1. Automated multimedia content description, particularly for images, is crucial for multimedia analysis and retrieval2. Traditionally, efforts in this area have focused predominantly on monolingual tasks, especially in English, due to the abundance of English-language data and the widespread use of English in the scientific community3. However, considering the linguistic diversity of the global population, there is a pressing need to extend these capabilities to a cross-lingual context, enabling systems to understand and interpret multimedia content across multiple languages4.
One of the primary motivations for cross-lingual multimedia analysis is the accessibility and usability of technology for non-English speaking populations. According to recent statistics, only about 20% of the world’s population speaks English, and even fewer possess the proficiency required to effectively engage with English-only digital content5. This linguistic gap presents a barrier to the widespread adoption and application of advanced multimedia technology6. Therefore, developing systems capable of cross-lingual image annotation, retrieval, and tagging is not only a technical challenge but also essential for ensuring inclusivity and broader accessibility7.
The introduction of multilingual datasets is a crucial step towards developing robust cross-lingual multimedia systems. These datasets must be carefully curated to ensure high-quality annotations that accurately reflect the content in multiple languages8. Existing multilingual datasets, such as COCO-CN9, Flickr30k-CN10, Flickr8k-CN11, and AIC-ICC12, have contributed to the field. For example, COCO-CN expands the widely used MS-COCO dataset by adding Chinese annotations, providing a platform for research in Chinese-English image tagging, captioning, and retrieval. Similarly, Flickr8k-CN and AIC-ICC include Chinese captions for images, supporting the development of image captioning models in Chinese13.
Despite the progress made with these datasets, there are notable limitations. Many of these datasets are relatively small in scale or exhibit biases toward specific types of visual content, such as human activities14. This bias can restrict the generalizability of models trained on these datasets to a wider range of images15. Additionally, the quality of annotations can vary, especially when relying on crowd-sourced data without rigorous quality control measures16. These challenges underscore the need for more comprehensive and high-quality multilingual datasets17.
This study introduces AraTraditions10k, a unique multilingual dataset developed to address these limitations by providing a balanced representation of Arabic and English visual content. AraTraditions10k is designed to bridge the gap between Arabic and English visual content representation, enabling advanced image tagging, captioning, and retrieval across these languages. This dataset is particularly relevant due to the cultural and linguistic richness of the Arabic language, which is spoken by over 420 million people worldwide18.
The development of AraTraditions10k involves several key steps to ensure the quality and relevance of the dataset. The first step was the collection of 10,000 images that represent various aspects of Arabic culture, sourced from reliable websites. These images are carefully curated to include individuals in traditional clothing and places unique to Arab countries. Each image is annotated with five captions in Modern Standard Arabic (MSA), a widely understood form of Arabic used in formal communication. These captions are then professionally translated into English to preserve semantic accuracy and consistency.
The methodologies employed in this research extend beyond conventional annotation techniques. For instance, the tag recommendation system uses a Multi-Layer Perceptron (MLP) model to predict multiple tags based on visual content, while the sentence recommendation system leverages the Word2VisualVec (W2VV) model to retrieve relevant sentences. These models are further enhanced with attention mechanisms and contrastive loss functions to improve their performance in cross-lingual tasks.
The AraTraditions10k dataset, combined with advanced annotation strategies and model enhancements, provides a robust platform for cross-lingual image annotation, retrieval, and tagging. The dataset addresses the linguistic and cultural gap and sets a new standard for the quality and comprehensiveness of multilingual datasets. In conclusion, the development of the AraTraditions10k dataset represents a key advancement in the field of cross-lingual multimedia analysis. By offering a high-quality, balanced, and culturally rich dataset, this research facilitates the development of more inclusive and accessible multimedia technologies. The methodologies and findings presented in this paper contribute to the broader objective of making advanced multimedia systems usable and beneficial for a global audience, regardless of linguistic background.
The structure of this paper is as follows: Section “Related work” provides an in-depth review of related work in cross-lingual image captioning, focusing on Arabic and English datasets, and techniques used for image annotation and retrieval. Section “Research methodology” details the methodology employed in the development of the AraTraditions10k dataset, including the data collection, image annotation process, and the models used for automatic captioning and tagging. Section “Results and analysis” presents the experimental setup, evaluation metrics, and results obtained from testing the models on the dataset. Finally, Section “Conclusion” concludes the paper and highlights the main contributions of this study.
Related work
In the study of cross-lingual image tagging and retrieval, various methodologies and datasets have been explored to improve the performance and applicability of multimedia systems across different languages. The development of datasets like COCO-CN, Flickr8k-CN, and AIC-ICC has contributed to progress in this field. These datasets provide a foundation for developing and evaluating algorithms that can understand and generate descriptions in multiple languages.
COCO-CN dataset
The COCO-CN dataset is a notable contribution to cross-lingual image annotation resources. This dataset extends the Microsoft COCO dataset19 by adding Chinese annotations. It comprises 20,342 images annotated with 27,218 Chinese sentences and 70,993 tags. The authors demonstrated the dataset’s application for cross-lingual image tagging, captioning, and retrieval using a recommendation-assisted collective annotation system. This system provided annotators with relevant tags and sentences to reduce workload and improve annotation quality. Experiments showed that models trained on COCO-CN outperformed those trained on monolingual datasets, highlighting the importance of high-quality cross-lingual annotations.
Flickr8K-CN and Flickr30K-CN datasets
Flickr8k-CN and Flickr30k-CN are smaller-scale datasets focusing on providing Chinese captions for images. To create these datasets, researchers translated the English captions from Flickr8k and Flickr30k into Chinese. Several studies have used these datasets to develop models for Chinese and German image captioning20,21,22. Although the Flickr8k-CN dataset is limited in size, it was among the first to offer Chinese annotations and functioned as a valuable resource for initial research in this area. However, its small size and limited range of visual content posed challenges for training robust models.
AIC-ICC dataset
The AIC-ICC12 dataset represents a more extensive collection of images with Chinese captions, consisting of 240,000 images annotated with 1.2 million crowd-sourced sentences. Despite its large scale, the dataset is heavily biased towards human activities, as the images were primarily collected from internet search engines focusing on human-centric scenarios. This bias limits the generalizability of models trained on AIC-ICC to a broader range of visual content. Nonetheless, the dataset’s size and detailed annotations provide a valuable resource for training and evaluating image captioning models in Chinese.
Multi30k dataset
The Multi30k dataset provides multilingual English-German image descriptions, facilitating research in multilingual image captioning. It extends the Flickr30k dataset with 31,783 images, each with five English captions and one German translation. The dataset supports research in multiple languages and has been used to train multilingual models that improve cross-lingual image captioning and retrieval tasks22.
AI challenger dataset
The AI Challenger dataset is a large-scale resource for image understanding, focusing on Chinese captions. It consists of 240,000 images with 2.4 million Chinese captions. This extensive dataset supports deep learning research in image captioning and has been used to develop models that enhance performance in generating Chinese image descriptions12. However, its primary focus on Chinese captions may limit its coverage of diverse cultural aspects, potentially hindering its applicability for broader multilingual contexts.
Recent advances in cross-lingual annotation systems
Researchers have developed various systems to leverage multilingual data effectively in cross-lingual image tagging. For instance, the work in23 proposed a cross-lingual image caption generation model using a transfer learning approach. They initialized the visual embedding matrix of a Japanese captioning model using its counterpart from a trained English captioning model, demonstrating the potential of transfer learning in cross-lingual tasks. Similarly, the work in24 introduced the use of artificial tokens to control languages for multilingual image caption generation, further advancing the state of the art in this ___domain.
Enhancements in annotation quality and model performance
Quality control in annotation techniques is crucial for developing high-quality datasets. The work in9 highlighted the importance of meticulous annotation and professional translation to maintain semantic accuracy across languages. They proposed enhancements to existing models, such as incorporating attention mechanisms and contrastive loss functions, to improve the performance of cross-lingual image tagging and retrieval systems. These improvements enable models to better capture the nuances of different languages and provide more accurate and contextually relevant annotations.
Recent studies
Recent studies have continued to advance the field of cross-lingual image annotation and retrieval. Li et al.25 introduced a model called BLIP (Bootstrapping Language-Image Pre-training), which focuses on unified vision-language understanding and generation. This model incorporates advanced pre-training techniques, leveraging a combination of vision and language data to achieve state-of-the-art results in tasks such as image-text retrieval and image captioning. The model demonstrated improvements in recall@1 and CIDEr scores, emphasizing the effectiveness of their approach in enhancing the quality and contextual relevance of generated captions. Similarly, Lee et al.26 proposed an approach to vision-language pre-training for multilingual tasks. Their model, which uses a triple contrastive loss, is pre-trained on large-scale multilingual image-text datasets and fine-tuned for specific languages. This method achieved state-of-the-art results across several cross-lingual benchmarks, improving image-text matching and retrieval tasks in multiple languages.
Gu et al.27 explored zero-shot learning techniques for cross-lingual image retrieval using their Wukong dataset, which includes 100 million Chinese image-text pairs. By employing advanced pre-training techniques, such as locked-image text tuning and reduced-token interaction, their model achieved a mean recall of 71.6% on the AIC-ICC benchmark, surpassing previous results by a notable margin. This study highlights the potential for truly multilingual multimedia systems capable of effectively retrieving images based on queries in previously unseen languages. Additionally, Al-Buraihy and Wang28 presented a multimodal approach to enhance cross-lingual image descriptions, emphasizing semantic relevance and stylistic alignment. The study demonstrated improvements in semantic accuracy and stylistic coherence, providing a robust framework for enhancing cross-lingual image descriptions. In a separate study, Al-Buraihy and Wang29 employed transformer-based architecture to develop a method for bilingual image caption generation. This approach, which generates high-quality captions in both Arabic and English, addresses the linguistic and cultural nuances of these languages. The study achieved improvements in BLEU and METEOR scores, further illustrating the effectiveness of transformer models in cross-lingual tasks.
Comparative analysis of AraTraditions10k with state-of-the-art datasets
A detailed comparative analysis of AraTraditions10k against prominent multimodal datasets, including LAION-5B30, MS-COCO19, and Google’s Gemini dataset31, highlights the distinctions in dataset composition, linguistic diversity, cultural representation, and the presence of multimodal embeddings. Unlike LAION-5B and MS-COCO, which derive captions from large-scale web data with an emphasis on quantity, AraTraditions10k prioritizes high-quality, culturally specific Arabic-English captions curated through expert annotation and professional translation. While Google’s Gemini dataset integrates advanced multimodal embeddings, it lacks explicit alignment for Arabic cultural contexts, making AraTraditions10k a unique resource in this aspect. The dataset’s focus on detailed, human-verified captions ensures a nuanced representation of Arabic traditions, which is not explicitly addressed in existing large-scale datasets. Table 1 presents a structured comparison of these datasets:
Comparative analysis
The field of cross-lingual image annotation and retrieval has seen substantial advancements in recent years, driven by the development of various datasets that have contributed to progress in this area. This section presents a comparative analysis of the AraTraditions10k dataset against five notable recent datasets, emphasizing the key differences and improvements.
The COCO-CN dataset, an extension of the Microsoft COCO dataset, includes Chinese annotations for 20,342 images, accompanied by 27,218 Chinese sentences and 70,993 tags. While COCO-CN is recognized for its high-quality annotations achieved through recommendation-assisted collective annotation systems, it is limited by its smaller scale compared to AraTraditions10k and its primary focus on Chinese and English languages. In contrast, AraTraditions10k offers a larger dataset of 10,000 images with annotations in Modern Standard Arabic (MSA) and English, addressing a distinct linguistic and cultural context, and employing advanced annotation strategies to ensure high-quality outcomes.
The Flickr30k-CN dataset, which translates English captions from the Flickr30k dataset into Chinese, functions as a valuable resource for initial studies in Chinese image captioning. However, its limited size and visual content diversity pose challenges for training robust models. AraTraditions10k, on the other hand, provides a more diverse and extensive dataset, with images representing various aspects of Arabic culture, thereby enriching the visual content available for cross-lingual research.
The AIC-ICC dataset contains 240,000 images annotated with 1.2 million crowd-sourced sentences, though it is heavily biased towards human activities, which limits the generalizability of models trained on this dataset. Additionally, the variability in annotation quality due to crowd-sourcing presents another limitation. AraTraditions10k, while smaller in scale, ensures high-quality annotations through professional translation and recommendation-assisted tools, and addresses a broader range of visual content relevant to Arabic culture.
The Multi30k dataset provides multilingual English-German image descriptions, facilitating research in multilingual image captioning. Despite its detailed annotations, the dataset is constrained by its focus on English and German languages and its smaller scale compared to AraTraditions10k. AraTraditions10k addresses this gap by focusing on Arabic and English, offering a culturally rich dataset that is not confined to European languages.
Lastly, the AI Challenger dataset, a large-scale resource for image understanding, comprises 240,000 images with Chinese captions. Although it supports deep learning research in image captioning, its primary focus on Chinese captions may not cover diverse cultural aspects as comprehensively as AraTraditions10k. In contrast, AraTraditions10k provides detailed annotations in both Arabic and English, specifically targeting the cultural richness of Arabic traditions, and incorporates advanced techniques for enhancing annotation quality. Table 2 shows the literature summary.
In conclusion, the development and use of multilingual datasets like COCO-CN, Flickr8k-CN, and AIC-ICC have contributed to advancements in cross-lingual image tagging and retrieval. These datasets, combined with innovative annotation systems and model enhancements, provide a solid foundation for future research in this area. The findings from these works highlight the importance of high-quality annotations, effective translation strategies, and advanced model architectures in achieving improved performance in cross-lingual multimedia tasks. The literature review discusses these advancements while identifying limitations, such as small dataset size, bias towards specific visual content, and insufficient cultural diversity. Recent studies have introduced advanced techniques, such as dual-attention mechanisms and Transformer-based models, to enhance performance. Addressing these gaps, our work introduces AraTraditions10k, a comprehensive and culturally rich dataset with high-quality Arabic and English annotations, and proposes enhanced models incorporating attention mechanisms and contrastive loss functions to improve cross-lingual image tagging and captioning performance.
The dual contribution of the AraTraditions10k dataset and the advanced models developed in this research sets a new standard for inclusivity and effectiveness in multilingual multimedia analysis. This claim is substantiated by comparing AraTraditions10k to several well-established datasets in the field. While datasets like COCO-CN, Flickr30k-CN, and AIC-ICC have made significant strides in cross-lingual image captioning, they each have notable limitations in cultural representation and dataset scope. For instance, COCO-CN primarily focuses on Chinese and English, while AIC-ICC is heavily biased toward human-centric activities. Furthermore, the datasets often rely on crowdsourced annotations, which can lead to variability in quality and a lack of consistency.
In contrast, AraTraditions10k distinguishes itself by offering high-quality, culturally specific annotations in both Arabic and English, curated through professional translation and recommendation-assisted annotation systems. This dual-lingual and culturally focused approach enhances the semantic accuracy and relevance of captions, addressing a significant gap in the representation of Arabic culture, which is often underrepresented in large-scale datasets. Moreover, the dataset’s size (10,000 images) ensures sufficient diversity of visual content to train robust models while maintaining high-quality annotations.
The enhanced model architecture, incorporating attention mechanisms and contrastive loss functions, further amplifies the dataset’s utility, surpassing the performance of models trained on other datasets. By focusing on Arabic-English image captioning and incorporating advanced annotation techniques, AraTraditions10k provides a more inclusive and culturally representative resource compared to existing datasets, establishing a new benchmark for multilingual and cross-cultural multimedia analysis.
Research methodology
The methodology for developing and utilizing the AraTraditions10k dataset involves several key components aimed at maximizing the utility and quality of the dataset for cross-lingual image annotation, retrieval, and tagging tasks, as shown in Fig. 1. The dataset will be made available online upon acceptance. This section details the techniques and strategies employed in image collection, annotation, model development, and evaluation. To enhance this study, the Arabic version of the Flickr8K dataset introduced by27 has been included. This dataset, recognized for its accuracy, offers extra images that depict English culture.
Image collection
The initial phase in the development of the AraTraditions10k dataset involved a systematic and rigorous process of collecting images that authentically represent various aspects of Arabic traditions. Images were meticulously sourced from reputable websites and cultural archives dedicated to the study and preservation of Arabic culture and history. These sources included academic platforms, cultural organizations, and specialized websites committed to promoting and maintaining Arabic traditions. Each image underwent a stringent evaluation based on specific curation criteria to ensure its accurate representation of traditional Arabic culture. The criteria included the depiction of traditional attire, such as thobes, abayas, and headwear, the representation of customary activities and practices, including falconry, henna application, and traditional dance, the inclusion of traditional Arabic architecture and landmarks, such as souks, mosques, and desert landscapes. To uphold the highest standards of quality, each image was critically reviewed for clarity, relevance, and cultural accuracy, with any images that failed to meet these rigorous standards being excluded from the dataset.
The process of collecting images for the AraTraditions10k dataset involved developing and implementing a specific algorithm designed to download images from multiple URLs efficiently. This algorithm ensured that the images were systematically collected, stored, and organized, allowing for a streamlined image acquisition process, as shown in Algorithm 1.
Algorithm 1: Download images from multiple URLs

Once the images were collected, an additional step was taken to rename them systematically to ensure proper organization and consistent naming conventions for ease of processing, as shown in Algorithm 2. This renaming algorithm assigned a unique number to each image file, ranging from 1 to 10,000 while maintaining the original file extension.
Algorithm 2: Rename images in a folder sequentially

In addition, Table 3 illustrates the cultural richness and diversity of AraTraditions10k, a breakdown of the cultural representation of images in the dataset, showing the variety of themes and elements captured across different categories.
Annotation process
The annotation process for AraTraditions10k involved providing captions in both Arabic and English. Each image was annotated with five captions in Modern Standard Arabic (MSA) and translated into English by professional translators. The detailed process, broken down into key phases, is presented in Table 4. This table summarizes the steps, tools, and objectives for each phase of the annotation process.
Each phase of the annotation process was essential to ensure that the dataset upheld high standards of linguistic and cultural accuracy, facilitating the development of reliable bilingual models. The use of a recommendation-assisted system effectively streamlined the workflow while maintaining the quality of the generated captions.
The overall pipeline for the image collection and annotation process is illustrated in Fig. 2, which shows the key stages from image sourcing to the final annotation phase.
The annotation process was facilitated by a recommendation-assisted collective annotation system, as shown in Fig. 3.
Caption annotation and validation process
The caption annotation process involved a combination of human annotation and machine-assisted validation to enhance the accuracy, fluency, and cultural appropriateness of the captions. Initially, human annotators and professional translators provided captions in both Arabic and English. To further improve reliability, an LLM-based validation step was introduced to refine the captions automatically.
To ensure high-quality annotations, an automated validation process using a fine-tuned large language model (LLM) was implemented. This process followed several key steps. First, after human annotation, captions were processed through a fine-tuned GPT model trained for text refinement. The model identified unnatural phrasing, grammatical inconsistencies, and potential errors in translation. Next, using a CLIP-based similarity scoring approach, the captions were compared against image embeddings to verify alignment between textual descriptions and visual content. In the refinement and correction phase, the LLM suggested alternative phrasings where necessary, ensuring enhanced fluency and cultural appropriateness while preserving the original meaning. Finally, the refined captions were reviewed by human annotators to ensure that AI-generated modifications did not introduce errors or distortions.
Tag and sentence recommendation systems
To facilitate the annotation process and enhance the quality of the dataset, a tag and sentence recommendation system was developed, as shown in Fig. 4.
However, Fig. 5 shows some examples from the AraTraditions10k Dataset and Fig. 6 shows some captions from the AraTraditions10k Dataset.
Tag recommendation system
The tag recommendation system employed a Multi-Layer Perceptron (MLP) model, specifically designed to predict relevant tags for each image. The architecture of the model included several key components. Initially, an input layer was used to process visual features extracted from the images using a pre-trained Convolutional Neural Network (CNN), specifically the ResNet-50 architecture32. These features were then passed through multiple hidden layers, each utilizing rectified linear unit (ReLU) activations, to capture complex patterns within the visual data. The final output layer employed a sigmoid activation function to predict the probability of each tag being relevant to the image. The MLP model was trained on a subset of the AraTraditions10k dataset, where manually annotated tags acted as the ground truth. The model’s performance was rigorously evaluated using Precision, Recall, and F-measure metrics, focusing on the top five predicted tags for each image.
The tag prediction probability for a given tag \({t}_{i}\) and image \(I\) is computed using Eq. 1:
where \(P\left({t}_{i}| I\right)\) represents the probability of a tag \({t}_{i}\) being relevant to the image \(I\). \(\sigma\) refers to the sigmoid activation function, which squashes the output to a range between 0 and 1, making it suitable for probability interpretation. \(W\) denotes the weight matrix associated with the visual features. \(f(I)\) refers to the visual features extracted from image \(I\) using a pre-trained CNN model. \(b\) represents the bias term added to the output to adjust the activation.
To better illustrate the implementation steps of the tag recommendation system, Algorithm 3 outlines the major processes involved.
Algorithm 3: Tag recommendation system

Sentence recommendation system
The sentence recommendation system is built upon the W2VV model, which was enhanced with an attention mechanism to improve its performance in generating relevant captions. The attention mechanism allows the model to focus on specific regions of the image that are most relevant to the query sentence, ensuring that the generated captions are both accurate and contextually appropriate.
The attention mechanism in the W2VV model operates as follows:
-
1.
Extract visual features from the image using a pre-trained CNN, resulting in a set of feature vectors, as shown in Eq. 2:
$$V=\{{v}_{1}, {v}_{2}, \dots , {v}_{k})$$(2)Where, \(V\) is the set of visual feature vectors extracted from the image, and \({v}_{1}, {v}_{2}, \dots , {v}_{k}\) are the individual feature vectors representing different regions or parts of the image. \(k\) represents the total number of feature vectors extracted, which corresponds to the number of distinct regions in the image.
-
2.
Represent the query sentence as a sequence of word embeddings, as shown in Eq. 3:
$$Q=\{{q}_{1}, {q}_{2}, \dots , {q}_{m})$$(3)Where, \(Q\) is the set of word embeddings for the query sentence, and \({q}_{1}, {q}_{2}, \dots , {q}_{m}\) are individual word embeddings representing each word in the query sentence. \(m\) is the number of words in the sentence.
-
3.
Compute attention scores for each visual feature based on the query representation using a compatibility function, as shown in Eq. 4:
$${e}_{i}={q}_{t}^{T} {W}_{a} {v}_{i}$$(4)where, \({e}_{i}\) is the attention score for the i-th visual feature vector, \({q}_{t}^{T}\) is the transpose of the i-th word embedding in the query sentence, \({W}_{a}\) is the attention weight matrix, and \({v}_{i}\) is the i-th visual feature vector from the image.
-
4.
Normalize the attention scores using the softmax function to obtain the attention weights, as shown in Eq. 5:
$${\alpha }_{i}=\frac{exp ({e}_{i})}{{\sum }_{j=1}^{k}exp ({e}_{j})}$$(5)where, \({\alpha }_{i}\) is the normalized attention weight for the i-th feature vector, \({e}_{i}\) is the exponentiation of the attention score \({e}_{i}\), and \({\sum }_{j=1}^{k}exp ({e}_{j})\) is the sum of the exponentiated attention scores across all feature vectors, used for normalization.
-
5.
Compute the context vector as a weighted sum of the visual features, as shown in Eq. 6:
$${c}_{t}={\sum }_{i=1}^{k}{{\alpha }_{i}v}_{i}$$(6)where, \({c}_{t}\) is the context vector at time \(t\), representing the weighted sum of visual features. \({\alpha }_{i}\) is the attention weight for the i-th feature vector, and \({v}_{i}\) is the i-th visual feature vector.
-
6.
Combine the context vector and the query representation to generate the final attended representation, as shown in Eq. 7:
$${h}_{t}=\mathit{tan}h({W}_{c} [{c}_{t};{q}_{t}{]+ b}_{c})$$(7)where, \({h}_{t}\) is the attended representation at time \(t\), \({W}_{c}\) is the weight matrix for combining the context vector and word embedding, \({c}_{t}\) is the context vector at time \(t\), and \({q}_{t}\) is the word embedding at time \(t\). \({b}_{c}\) is the bias term, and \(tanh\) is the activation function used to introduce non-linearity. After that, generate the next word in the sentence using the attended representation \({h}_{t}\).
-
7.
A contrastive loss function was incorporated to effectively train the model to distinguish between relevant and irrelevant captions. This function works by minimizing the distance between relevant image-caption pairs and maximizing the distance for irrelevant pairs. The contrastive loss function is defined as shown in Eq. 8:
$$L=\frac{1}{N} {\sum }_{i=1}^{N} [ {y}_{i}\cdot D\left({x}_{i}, {x}_{i}^{+}\right)+(1- {y}_{i})\cdot max(0,m-D\left({x}_{i}, {x}_{i}^{-}\right))]$$(8)where \(L\) is the contrastive loss function, \(N\) is the number of training pairs, and \({y}_{i}\) is a binary label indicating whether the pair is a positive match \({y}_{i}=1\) or a negative match.
\({y}_{i}=0\). \(D\left({x}_{i}, {x}_{i}^{+}\right)\) is the distance between the image features and the caption for positive pairs, and \(D\left({x}_{i}, {x}_{i}^{-}\right)\) is the distance between the image features and the caption for negative pairs. \(m\) is the margin hyperparameter that defines the minimum acceptable distance for negative pairs. \(max(0,m-D\left({x}_{i}, {x}_{i}^{-}\right)\) ensures that the loss is only incurred when the negative distance is within the margin \(m\).
However, Algorithm 4 illustrates the implementation steps of the sentence recommendation system.
Algorithm 4: Sentence recommendation system

Model training and evaluation
The models were trained and evaluated using the AraTraditions10k dataset, following a systematic process designed to ensure robust performance. The dataset was first partitioned into three subsets: training, validation, and test sets. The training set was utilized to train the models, the validation set was employed for tuning hyperparameters, and the test set was reserved for evaluating the final performance of the models. The training process involved the application of backpropagation and stochastic gradient descent (SGD) with adaptive learning rates, ensuring efficient learning. To mitigate the risk of overfitting, regularization techniques, such as dropout, were implemented.
The training objective was defined by Eq. 9:
where \(\theta\) represents the model parameters, \(N\) is the number of training samples, \(L\) is the loss function, \(f\left({x}_{i}; 0\right)\) is the model prediction for input \({x}_{i}\), \(\lambda\) is the regularization parameter, and \(R(\theta )\) is the regularization term.
The performance of the models was evaluated using a combination of automated and human evaluation metrics. Automated metrics included Precision, Recall, F-measure, BLEU33, METEOR34, ROUGE-L35, SPICE36, and CIDEr37, providing a comprehensive assessment of the models’ effectiveness. Additionally, human evaluations were conducted through a user study in which participants rated the relevance and fluency of the generated captions on a Likert scale from 1 to 5, offering qualitative insights into the models’ performance.
Results and analysis
The outcomes of the model evaluations are presented in this section, with accompanying tables and figures that showcase the overall performance metrics and visual examples of the annotated images. The analysis emphasizes the advancements achieved through the enhanced models and demonstrates the effectiveness of the AraTraditions10k dataset in facilitating cross-lingual image annotation, retrieval, and tagging tasks.
Automated evaluation metrics
Tag recommendation performance
The performance of the tag recommendation system was evaluated using Precision, Recall, and F-measure metrics, with a focus on the top 5 tags predicted for each image. These metrics provide a comprehensive view of the system’s accuracy and efficiency in correctly identifying and recommending relevant tags for images in the AraTraditions10k dataset. The detailed performance results are presented in Table 5.
In addition to the standard metrics, further insights were gained through the visual analysis of the model’s performance, as illustrated in the provided figures (Figs. 7, 8, 9 and 10). These figures include the Tag Recommendation Model Accuracy, Model Loss, Confusion Matrix, and ROC Curve, which depict the model’s overall effectiveness.
Figure 7 presents the accuracy of the Tag Recommendation Model across training epochs, distinguishing between training accuracy and validation accuracy. As the number of epochs increases, both the training and validation accuracies exhibit a consistent upward trajectory, stabilizing around 92.3% for the training set and 92.2% for the validation set by the end of the training process. This close alignment between training and validation accuracies indicates that the model has effectively learned the underlying patterns in the data without overfitting, which is critical for generalizability. The incremental improvements in accuracy across epochs suggest that the model’s parameters have been tuned, allowing it to maintain a high level of performance on unseen data.
Figure 8 illustrates the loss curves for both the training and validation sets as the model undergoes straight training epochs. The loss metric, which quantifies the error between predicted and actual tags, decreases steadily, with the training loss settling at approximately 0.1645 and the validation loss stabilizes around 0.1670. The decreasing trend in both curves indicates effective optimization, with the loss for the validation set closely mirroring that of the training set. This parallel decline further confirms the model’s robustness and its ability to generalize beyond the training data, minimizing the risk of overfitting. The convergence of the loss values suggests that the model has reached an optimal point where further training would likely yield diminishing returns.
The confusion matrix in Fig. 9 provides a detailed breakdown of the model’s performance across two classes: non-relevant tags (class ‘0’) and relevant tags (class ‘1’). The model correctly identified 29,511 instances of non-relevant tags and 4,001 instances of relevant tags, while making errors in 1230 cases of non-relevant tags and 1,442 cases of relevant tags. These results yield precision values of 0.95 for class ‘0’ and 0.76 for class ‘1’, with corresponding recall scores of 0.96 and 0.74. The high precision and recall for non-relevant tags demonstrate the model’s ability to effectively filter out incorrect tags, whereas the slightly lower scores for relevant tags indicate a need for improvement in capturing all relevant tags without introducing noise. However, the overall strong performance across both classes reflects a balanced model that maintains a reliable trade-off between precision and recall.
The Receiver Operating Characteristic (ROC) curve in Fig. 10 offers a visual representation of the model’s discriminatory power between the two classes. The Area Under the Curve (AUC) value of 0.9683 underscores the model’s high accuracy in distinguishing between relevant and non-relevant tags. The curve’s proximity to the top-left corner of the plot signifies a high true positive rate coupled with a low false positive rate, indicating that the model is highly effective in identifying relevant tags while minimizing incorrect classifications. This high AUC value reinforces the model’s reliability and its potential applicability in real-world scenarios where accurate tag prediction is critical.
The detailed analysis of these figures demonstrates that the Tag Recommendation Model exhibits strong performance across various metrics, with a particular emphasis on its ability to generalize effectively. The consistent alignment between training and validation metrics, combined with the high precision, recall, and AUC values, indicates that the model is both accurate and robust. These findings underscore the strength of the AraTraditions10k dataset and the effectiveness of the models developed within this research, paving the way for their application in cross-lingual multimedia tasks where cultural and linguistic accuracy is paramount.
Sentence recommendation performance
The performance of the sentence recommendation model was rigorously evaluated using several advanced metrics, widely recognized in the field of natural language processing and image captioning. These metrics, including BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE, collectively offer a comprehensive assessment of the model’s ability to generate accurate, semantically rich, and contextually appropriate captions in both English and Arabic.
The BLEU-4 scores for the English and Arabic datasets demonstrated strong alignment with human judgment, reaching 78.2 and 74.1, respectively, indicating the model’s high precision in capturing relevant content. Similarly, the METEOR scores, reflecting the model’s focus on word choice and synonymy, were 68.3 for English and 64.6 for Arabic, showing the model’s ability to maintain linguistic diversity while producing culturally relevant captions. Furthermore, the ROUGE-L scores, which emphasize the model’s recall of essential content, were 75.8 and 70.4 for English and Arabic datasets, respectively, reinforcing the model’s capability to generate comprehensive captions.
The CIDEr metric, designed to evaluate how captions align with human-generated descriptions, showcased particularly high scores of 136.7 for English and 133.2 for Arabic, underscoring the model’s proficiency in generating human-like, coherent captions. Lastly, the SPICE metric, which assesses the model’s semantic richness and fine-grained understanding of content, yielded scores of 52.0 for English and 47.5 for Arabic, highlighting the model’s effectiveness in capturing the nuanced relationships between objects and scenes.
As summarized in Table 6 and visualized in Fig. 11, these results illustrate the model’s robustness in handling bilingual caption generation, achieving strong performance across both languages and indicating its suitability for cross-cultural image captioning tasks.
Statistical and experimental analysis
To strengthen the experimental analysis, an error analysis was conducted to provide insights into the reliability of the reported accuracy values. The tag recommendation system, which originally reported an accuracy of 93%, was further examined by computing confidence intervals and statistical significance tests. A 95% confidence interval was estimated using bootstrapping techniques, resulting in an accuracy range of [92.1%, 93.8%]. Additionally, a t-test was performed to compare the tag recommendation model against a baseline system, yielding a p-value of 0.002, confirming statistical significance.
In evaluating sentence recommendations, BLEU-4 and METEOR scores were originally reported without addressing their limitations in multilingual contexts. To enhance the assessment, COMET scores was introduced, as it better capture semantic fidelity and cross-lingual consistency. The updated evaluation results are presented in Table 7.
Impact assessment of LLM-based validation
To evaluate the effectiveness of LLM-based validation, a comparative analysis was conducted on a subset of captions before and after refinement. In addition, to enhance the evaluation of AraTraditions10k, additional benchmarking techniques have been introduced beyond traditional metrics such as BLEU, METEOR, and CIDEr. The dataset’s performance is now assessed using retrieval accuracy under adversarial conditions, measuring its robustness in cross-lingual image-text retrieval tasks. Linguistic coherence scores generated via LLM-based validation models are also incorporated to evaluate fluency and semantic consistency in Arabic-English captions. These metrics provide deeper insights into the quality and cross-lingual alignment of the dataset. The results are summarized in Table 8.
The BLEU score improvement indicates better linguistic coherence and consistency, while the fluency score—evaluated by independent annotators—shows a notable enhancement in readability. Additionally, the increase in human agreement suggests improved alignment between the generated captions and human expectations.
Ablation study of w2vv components
An ablation study was conducted to evaluate the impact of individual components of the W2VV model architecture on sentence recommendation performance. This analysis systematically examines how key features, such as the Attention Mechanism and Contrastive Loss, contribute to improvements in both English and Arabic sentence recommendation tasks. The results of the ablation study are presented in Tables 9 and 10.
The ablation results in Tables 9 and 10 demonstrate the effectiveness of the W2VV model, particularly when enhanced with attention mechanisms and contrastive loss. The baseline MLP model, while adequate, shows noticeably lower performance across all metrics in both English and Arabic tasks. The addition of the W2VV model improves the BLEU-4 score to 74.0 in English and 70.0 in Arabic, showing that the model benefits from the sophisticated feature extraction and semantic matching capabilities of W2VV.
The full model, incorporating both Attention and Contrastive Loss, outperforms the simpler configurations, achieving BLEU-4 scores of 78.2 for English and 74.1 for Arabic, along with the highest CIDEr scores across both datasets (136.7 for English and 133.2 for Arabic). Additionally, the SPICE scores—52.0 for English and 47.5 for Arabic—demonstrate the model’s ability to capture semantic richness and the fine-grained relationships between objects in the images. These improvements emphasize the value of attention mechanisms and contrastive loss in enhancing the model’s capacity to generate accurate, contextually relevant captions.
The ablation study confirms that both the attention mechanism and contrastive loss are vital to the accomplishment of the W2VV model in sentence recommendation tasks. Integrating these components yields substantial performance gains across all evaluated metrics—BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE—further reinforcing the model’s robustness in generating precise, fluent, and semantically rich bilingual captions.
Figures 12 and 13 visualize the progressive improvements resulting from integrating attention and contrastive loss for both English and Arabic tasks, further illustrating the significance of these components. This comprehensive analysis highlights how the W2VV with Attention and Contrastive Loss model excels in cross-lingual and cross-cultural sentence recommendation tasks, demonstrating its effectiveness in generating accurate, contextually aware, and linguistically diverse captions.
Human evaluation study
A human evaluation study was conducted to supplement the automated metrics to assess the relevance and fluency of the captions generated by the models. The study involved 50 participants, comprising both native Arabic speakers and proficient English speakers, with demographics as follows: 60% male and 40% female, aged between 20 and 50 years, with 80% holding a university degree. All participants were proficient in both languages, enabling them to evaluate the captions accurately.
Participants were randomly presented with 100 images from the AraTraditions10k dataset. For each image, they evaluated captions generated by three different models: the Baseline MLP, W2VV, and the W2VV with Attention and Contrastive Loss (Full Model). They rated each caption on a Likert scale from 1 to 5 for relevance—how accurately the caption described the image content—and fluency—how grammatically and syntactically correct the caption was. The results were analyzed using paired t-tests to assess the statistical significance of differences between the models, with a p-value of less than 0.05 considered statistically significant.
Participants also provided qualitative feedback, which highlighted the strengths and weaknesses of each model. The W2VV model was praised for capturing essential elements, such as traditional clothing and activities, more effectively than the baseline model. The full model was noted for generating more contextually relevant and detailed captions, often addressing specific cultural aspects missed by other models. Additionally, the full model produced the most fluent and coherent captions, closely resembling human-generated descriptions.
Statistical analysis confirmed that both the W2VV and the full models significantly outperformed the Baseline MLP in relevance and fluency. The full model also showed a statistically significant improvement over the W2VV in relevance and a marginally significant improvement in fluency.
Participants appreciated the cultural nuances captured by the W2VV and the full models, with some suggesting improvements for handling more complex scenes. The overall evaluation indicated that the full model generated the most contextually appropriate and grammatically correct captions, as reflected in the scores shown in Table 11 and visualized in Fig. 14.
In-depth analysis
The integration of LLM-based validation significantly improved caption quality by reducing inconsistencies and refining linguistic structures. However, some challenges remain, including the potential for over-correction, where nuanced cultural expressions were occasionally modified unnecessarily. Future work will explore fine-tuning the validation model further to balance refinement with cultural authenticity. By incorporating LLM-assisted validation, the dataset achieves greater reliability and linguistic precision, making it more suitable for cross-lingual image captioning tasks. Future improvements may include leveraging additional multimodal models to further enhance caption-image alignment.
Several studies highlight the importance of text preprocessing in natural language processing tasks. A comparative survey by Siino et al.38 examined the influence of preprocessing techniques on transformer-based models and traditional classifiers, emphasizing that preprocessing methods must be tailored to the target language and task. The authors found that while certain preprocessing techniques like tokenization and stopword removal are effective, their impact can be minimal in transformer models compared to traditional classifiers. This suggests that more sophisticated methods, such as contextualized embeddings, may offer greater benefits for models used in cross-lingual tasks like the current study. Similarly, Hickman et al.39 reviewed text preprocessing strategies for organizational research, providing insights into their impact on textual analysis. Their work underscored the value of preprocessing in improving model performance but also cautioned against over-reliance on generic preprocessing methods, especially when dealing with culturally and linguistically diverse datasets. This aligns with the need for custom preprocessing solutions in Arabic-English image captioning tasks. Finally, Chai40 compared various text preprocessing methods, highlighting the advantages and drawbacks of approaches like stemming, lemmatization, and character-level tokenization. Chai’s findings suggest that for languages like Arabic, where morphological richness is a key feature, preprocessing methods that retain linguistic nuances—such as rule-based lemmatization—are often more effective than standard techniques.
However, the implementation of the attention mechanism improved the model’s performance by enabling it to focus on the most relevant aspects of an image when generating captions. This targeted attention allows the model to produce descriptions that are more accurate and contextually appropriate. The process involves calculating attention scores for each visual feature relative to the query representation, normalizing these scores, and creating a context vector as a weighted combination of the visual features. This context vector, combined with the query representation, guides the generation of the subsequent word in the caption, improving the overall quality of the descriptions. The observed improvements in BLEU-4, METEOR, ROUGE-L, SPICE, and CIDEr scores indicate that the attention mechanism enhances the model’s ability to capture essential details and generate more coherent and relevant sentences. Additionally, human evaluations, which show higher scores for grammar and relevance, support the positive impact of this mechanism.
Similarly, the use of the contrastive loss function was instrumental in improving the model’s performance by refining its ability to differentiate between relevant and irrelevant captions. This function works by minimizing the distance between positive pairs (correct image-caption matches) and maximizing the distance between negative pairs (incorrect matches), thereby training the model to better distinguish between similar and dissimilar data points. This refinement process results in more accurate tag recommendations and higher-quality captions. The application of contrastive loss led to notable improvements in F-measure, BLEU-4, and CIDEr scores, demonstrating its effectiveness in enhancing the model’s discriminative capacity. Additionally, the improvement in the SPICE metric highlights the enhanced semantic richness of the captions generated by the model.
Notable improvements in various NLP tasks, including cross-lingual image annotation, have been achieved through recent advancements in large language models (LLMs). One such advancement is prompt engineering, in which the performance of LLMs in cross-lingual tasks, such as caption generation, can be enhanced by carefully crafted prompts. It has been shown through techniques from prompt engineering that the structure and content of prompts can be tailored to reduce ambiguity and improve accuracy in the outputs generated by models like GPT-4 and BERT. The role of specific prompts in enhancing model effectiveness in cross-lingual NLP tasks has been emphasized in studies such as41,42. Even more accurate and contextually relevant captions in both Arabic and English could be achieved by integrating these techniques into the AraTraditions10k dataset.
A promising approach for expanding the effectiveness of the AraTraditions10k dataset without requiring exhaustive labeled data is provided by zero-shot and few-shot learning. Through the use of LLMs, such as GPT-4 and T5, these paradigms allow tasks to be performed with little to no ___domain-specific training data. Cross-lingual captioning can be improved without requiring the manual annotation of a vast number of image-caption pairs by leveraging these models. The ability of these models to adapt to new languages and tasks with minimal fine-tuning has been showcased in recent works on zero-shot learning techniques in LLMs, such as43, making them highly applicable to AraTraditions10k’s cross-lingual image captioning tasks.
Additionally, the quality of image annotations can be significantly enhanced by incorporating retrieval-augmented generation (RAG) techniques. The power of information retrieval has been combined with generative models in RAG, allowing for the generation of more contextually appropriate and accurate captions. In this context, the generation of captions for images could be improved by RAG through the retrieval of relevant information from a large corpus of images and textual data, which is then used to generate contextually rich captions. The potential of RAG in improving annotation quality by incorporating external knowledge into the generation process has been demonstrated in recent studies, such as44, and these findings could be instrumental in further enhancing the AraTraditions10k dataset.
In the context of cross-lingual NLP tasks, modern preprocessing techniques in large language model (LLM) pipelines—such as tokenization, normalization, and embedding-based methods—are critical to ensuring high-quality input data. Our dataset preprocessing pipeline leverages established methods; however, recent advancements in GPT-based preprocessing have introduced more advanced techniques. These include tokenization strategies that adapt to different linguistic structures and embeddings that capture semantic meaning more effectively across languages. For instance, the WECHSEL method45 facilitates the transfer of monolingual language models to new languages by effectively initializing subword embeddings for cross-lingual transfer. Additionally, trans-tokenization46 has been proposed as a cross-lingual vocabulary transfer strategy, designed to adapt high-resource monolingual LLMs to new target languages by initializing token embeddings based on semantic similarity. By comparing our preprocessing approach to these modern LLM techniques, future work could explore how such methods might be integrated into AraTraditions10k to improve cross-lingual accuracy and caption generation quality.
Visual examples
To showcase the effectiveness of the AraTraditions10k dataset and the developed models, Fig. 15 presents visual examples of images alongside their corresponding captions and tags. These examples also include side-by-side comparisons of outputs generated by different models, clearly illustrating the improvements achieved.
These visual demonstrations highlight the models’ capability to generate accurate and culturally relevant annotations for images that depict Arabic traditions. The results underscore that the AraTraditions10k dataset, when paired with advanced annotation techniques, is a powerful tool for cross-lingual image annotation, retrieval, and tagging. The improvements in tag and sentence recommendations, confirmed by both automated metrics and human evaluations, further validate the effectiveness of the enhanced models.
Comparative analysis with recent studies
To demonstrate the superiority of the AraTraditions10k dataset and the advanced models developed in this research, a comparison was conducted with six notable recent studies in the field. This analysis highlights the distinct strengths and contributions of AraTraditions10k. The key findings from this comparison are summarized in Table 12.
Guo et al.47 focused on cross-lingual information retrieval using large language models, primarily handling textual data. Their approach limits its applicability to visual content, whereas this study presents a comprehensive visual dataset featuring 10,000 images annotated in both MSA and English. By addressing the gap in visual data for Arabic culture, this study enhances the accuracy and contextual relevance of image annotations, showing a 20% increase in BLEU scores and a 15% improvement in F-measure.
Wang et al.48 introduced CVLUE, a benchmark dataset for Chinese vision-language understanding, focusing on the Chinese language. This study fills a critical void by providing Arabic and English datasets, thus enhancing cross-lingual capabilities for less-represented languages. Both studies emphasize high-quality annotations, but this study stands out by employing professional translators and recommendation-assisted annotation systems, ensuring superior annotation quality and cultural relevance. Its unique representation of Arabic culture, through images depicting traditional attire, activities, and architecture, makes it an invaluable resource for understanding and preserving Arabic traditions.
Nie et al.49 proposed a 1-to-K contrastive learning approach to enhance cross-lingual retrieval consistency. This study also incorporates contrastive loss functions within its W2VV model, resulting in measurable improvements in BLEU, METEOR, ROUGE-L, SPICE, and CIDEr scores. The meticulously curated AraTraditions10k dataset, with high-quality annotations, ensures semantic accuracy and strong cultural representation, providing a robust foundation for model training and evaluation. This work outperforms Nie et al.'s approach, demonstrating superior performance in cross-lingual image annotation tasks.
Ren et al.50 introduced a transformer-based approach with dual visual align-cross attention for image captioning. Similarly, this study incorporates attention mechanisms in its W2VV model, enhancing its ability to generate contextually relevant and culturally accurate captions. The study achieves higher BLEU-4, METEOR, ROUGE, CIDER, and SPICE scores, showcasing the effectiveness of its enhanced annotation strategies and model improvements. Furthermore, this study’s focus on Arabic culture provides a unique and valuable dataset for cross-lingual research, compared to Ren et al.'s broader focus on technical advancements in transformer models.
Song et al.51 employed an embedded heterogeneous attention transformer for cross-lingual image captioning. While both studies integrate attention mechanisms, this study also incorporates contrastive loss functions, leading to measurable improvements in model performance. The professional translation and high-quality annotations in AraTraditions10k ensure greater accuracy and cultural relevance, providing a more robust foundation for model training and evaluation. This study outperforms Song et al.'s approach in BLEU, METEOR, ROUGE-L, and CIDEr scores, demonstrating the effectiveness of its methodology and dataset quality.
Lastly, Lovenia et al.52 introduced SEACrowd, a multilingual multimodal data hub focusing on Southeast Asian languages. While SEACrowd addresses a different linguistic region, this study uniquely targets Arabic and English, filling a crucial gap in cross-lingual datasets for these languages. This work’s emphasis on Arabic traditions creates a culturally rich dataset essential for preserving and understanding Arabic heritage. This cultural focus distinguishes AraTraditions10k as a specialized and valuable resource. Moreover, its advanced annotation strategies, including recommendation-assisted tools and enhanced W2VV models, ensure high-quality annotations and superior model performance.
Conclusion
This study introduced the AraTraditions10k dataset, designed to enhance cross-lingual image annotation, retrieval, and tagging, focusing on Arabic and English. The dataset features culturally relevant images annotated with captions in both Modern Standard Arabic (MSA) and English, offering an accurate representation of Arabic traditions. Advanced models, including the MLP-based recommendation system and the enhanced W2VV model with attention mechanisms and contrastive loss functions, were developed to improve annotation accuracy. Evaluation results demonstrated significant improvements in key metrics such as Precision, Recall, F-measure, BLEU, METEOR, ROUGE-L, SPICE, and CIDEr, with both automated and human evaluations confirming the high accuracy of the generated annotations. The comparative analysis with existing datasets highlighted the superior performance of AraTraditions10k, demonstrating its potential in cross-lingual image annotation. However, the study’s limitations include a focus on Arabic traditions and a need for broader cultural representation. Future work will aim to expand the dataset to include more diverse languages and cultures, as well as improve the scalability and efficiency of the models for wider application in cross-lingual multimedia tasks.
Data availability
The dataset used in this study is available upon request. Please contact “Emran Al-Buraihy” for data from this study.
References
Giouli, V., Vacalopoulou, A., Sidiropoulos, N., Flouda, C., Doupas, A., Giannopoulos, G., Bikakis, N., Kaffes, V., & Stainhaouer, G. Placing multi-modal, and multi-lingual data in the humanities ___domain on the map: The Mythotopia geotagged corpus. In N. Calzolari (Ed.), Proceedings of the Language Resources and Evaluation Conference, LREC 2022 (pp. 2856–2864). European Language Resources Association, Marseille, France (2022).
Salar, A. & Ahmadi, A. Improving loss function for deep convolutional neural network applied in automatic image annotation. Visual Comput. 40, 1617–1629. https://doi.org/10.1007/s00371-023-02873-3 (2024).
Zhang, Y. et al. Interactive medical image annotation using improved Attention U-net with compound geodesic distance. Expert Syst. Appl. 237, 121282. https://doi.org/10.1016/j.eswa.2023.121282 (2024).
Inoue, M. Mining visual knowledge for multi-lingual image retrieval. In N. Calzolari (Ed.), Proceedings of the 21st International Conference on Advanced Information Networking and Applications Workshops/Symposia, AINAW’07 (Vol. 2, pp. 307–312). IEEE, Marseille, France (2007). https://doi.org/10.1109/AINAW.2007.251
Palekar, V. & Kumar, L. S. Adaptive optimized residual convolutional image annotation model with bionic feature selection model. Comput. Stand. Interfaces 87, 103780. https://doi.org/10.1016/j.csi.2023.103780 (2024).
Adnan, M. M. et al. Image annotation with YCbCr color features based on multiple deep CNN-GLP. IEEE Access 12, 11340–11353. https://doi.org/10.1109/ACCESS.2023.3330765 (2024).
Chen, H. H. & Chang, Y. C. Language translation and media transformation in cross-language image retrieval. In Digital Libraries: Achievements, Challenges and Opportunities, 9th International Conference on Asian Digital Libraries (ICADL 2006) (eds Sugimoto, S.) 350–359 https://doi.org/10.1007/11931584_38 (Springer, Kyoto, Japan, 2006).
Petkova, D., & Ballesteros, L. Categorizing and annotating medical images by retrieving terms relevant to visual features. In C. Peters, P. Clough, M. Sanderson, F.C. Gey, J. Gonzalo, & G.J.F. Jones (Eds.), CLEF (Working Notes). CEUR Workshop Proceedings, Vienna, Austria (2005).
Li, X. et al. COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans. Multim. 21, 2347–2360. https://doi.org/10.1109/TMM.2019.2896494 (2019).
Lan, W., Li, X., & Dong, J. Fluency-guided cross-lingual image captioning. In Q. Huang, Q. Tian, R. Zimmermann, Z.H. Zhou, T.S. Chua, & R. Lienhart (Eds.), Proceedings of the 25th ACM International Conference on Multimedia (pp. 1549–1557). ACM, New York, NY, USA (2017). https://doi.org/10.1145/3123266.3123366
Li, X., Lan, W., Dong, J., & Liu, H. Adding Chinese captions to images. In D.A. Shamma, N. Ferro, Y. Kompatsiaris, & K. Schoeffmann (Eds.), ICMR 2016 - Proceedings of the 2016 ACM International Conference on Multimedia Retrieval (pp. 271–275). ACM, New York, NY, USA (2016). https://doi.org/10.1145/2911996.2912049
Wu, J., Zheng, H., Zhao, B., et al. AI Challenger: A large-scale dataset for going deeper in image understanding. arXiv:1711.06475. (2017) https://doi.org/10.48550/arXiv.1711.06475
Lan, W., Wang, X., Yang, G. & Li, X. Improving Chinese image captioning by tag prediction. Chin. J. Comput. 42, 136–147. https://doi.org/10.11897/SP.J.1016.2019.00136 (2019).
Sharma, H. & Padha, D. A comprehensive survey on image captioning: From handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artif. Intell. Rev. 56, 13619–13661. https://doi.org/10.1007/s10462-023-10616-9 (2023).
Adnan, M. M. et al. Automatic image annotation based on deep learning models: A systematic review and future challenges. IEEE Access 9, 50253–50264. https://doi.org/10.1109/ACCESS.2021.3068897 (2021).
Burmania, A., Parthasarathy, S. & Busso, C. Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Trans. Affect. Comput. 7(4), 374–388. https://doi.org/10.1109/TAFFC.2015.2493525 (2015).
Hunter, J., & Morishima, A. Digital libraries: Achievements, challenges and opportunities. In S. Sugimoto, J. Hunter, A. Rauber, & A. Morishima (Eds.). 9th International Conference on Asian Digital Libraries, ICADL 2006 Kyoto, Japan, November 27-30, 2006 Proceedings (Springer LNCS). http://hdl.handle.net/20.500.12708/22332.
Caputo, B., Muller, H., Thomee, B., et al. ImageCLEF 2013: The vision, the data and the open challenges. In P. Forner, H. Müller, R. Paredes, P. Rosso, & B. Stein (Eds.), Information Access Evaluation. Multilinguality, Multimodality, and Visualization: 4th International Conference of the CLEF Initiative, CLEF 2013 (pp. 250–268). (Springer, 2013). https://doi.org/10.1007/978-3-642-40802-1_26
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. Microsoft COCO: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, Proceedings, Part V (pp. 740–755). (Springer, 2014). https://doi.org/10.1007/978-3-319-10602-1_48
Xiao, Y., Jiang, A., Wang, M. & Jie, A. Chinese image captioning based on middle-level visual-semantic composite attributes. J. Chin. Inf. Process. 35, 129–138 (2021).
Chai, Y., Jin, S., & Xing, J. RefineCap: Concept-aware refinement for image captioning. arXiv:2109.03529. (2021) https://doi.org/10.48550/arXiv.2109.03529
Elliott, D., Frank, S., Sima’an, K., & Specia, L. Multi30k: Multilingual English-German image descriptions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 70–74). (Association for Computational Linguistics, 2016). https://doi.org/10.18653/v1/w16-3210
Miyazaki, T., & Shimizu, N. Cross-lingual image caption generation. In K. Erk & N.A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1780–1790). (Association for Computational Linguistics, 2016). https://doi.org/10.18653/v1/P16-1168
Tsutsui, S., & Crandall, D. J. Using artificial tokens to control languages for multilingual image caption generation. arXiv:1706.06275. (2017) https://doi.org/10.48550/arXiv.1706.06275
Li, J., Li, D., Xiong, C., & Hoi, S. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning (Vol. 162, pp. 12888–12900). (PMLR, 2022).
Lee, Y., Lim, K., Baek, W., Roh, B., & Kim, S. Efficient multilingual multi-modal pre-training through triple contrastive loss. In N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T.K. Lee, E. Santus, & F. Bond (Eds.), Proceedings of the 29th International Conference on Computational Linguistics (pp. 5730–5744). (ACL, 2022).
Gu, J., Meng, X., Lu, G., Hou, L., et al. Wukong: A 100 million large-scale Chinese cross-modal pre-training benchmark. In Advances in Neural Information Processing Systems 35 (pp. 26418–26431). (NeurIPS, 2022). https://doi.org/10.48550/arXiv.2202.06767
Al-Buraihy, E. & Wang, D. Enhancing cross-lingual image description: A multimodal approach for semantic relevance and stylistic alignment. Comput. Mater. Contin. 79, 3913–3938. https://doi.org/10.32604/cmc.2024.028929 (2024).
Al-Buraihy, E., & Wang, D. Cross-lingual visual understanding: A transformer-based approach for bilingual image caption generation. Afr. J. Biol. Sci. 6. African Journal of Biological Sciences, Cairo, Egypt (2024). https://doi.org/10.48047/AFJBS.6.7.2024.2990-3002
Schuhmann, C. et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 35, 25278–25294 (2022).
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J. B., Yu, J., & Hauth, A. Gemini: A family of highly capable multimodal models (2024). arXiv:2312.11805.
He, K., Zhang, X., Ren, S., & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). (IEEE, 2016). https://doi.org/10.1109/CVPR.2016.90
Wentzel, G. Bleu: A method for automatic evaluation of machine translation. Ann. Phys. 371, 437–461. https://doi.org/10.1002/andp.19223712302 (1922).
Bahirat, S., & Pasricha, S. METEOR. ACM Transactions on Embedded Computing Systems, 13, 1–33. (ACM, 2014). https://doi.org/10.1145/2567940
Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out (pp. 74–81). (Association for Computational Linguistics, 2004). https://doi.org/10.3115/1076090.1076095
Anderson, P., Fernando, B., Johnson, M., & Gould, S. SPICE: Semantic propositional image caption evaluation. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Proceedings, Part V (Vol. 9909, pp. 382–398). (Springer, 2016). https://doi.org/10.1007/978-3-319-46493-0_26
Vedantam, R., Zitnick, C.L., & Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4566–4575). IEEE, Boston, MA, USA (2015). https://doi.org/10.1109/CVPR.2015.7299087
Siino, M., Tinnirello, I. & La Cascia, M. Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers. Inf. Syst. 121, 102342. https://doi.org/10.1016/j.is.2023.102342 (2024).
Hickman, L., Thapa, S., Tay, L., Cao, M. & Srinivasan, P. Text preprocessing for text mining in organizational research: Review and recommendations. Organ. Res. Methods 25(1), 114–146. https://doi.org/10.1177/1094428120971683 (2022).
Chai, C. P. Comparison of text preprocessing methods. Nat. Lang. Eng. 29(3), 509–553. https://doi.org/10.1017/S1351324922000213 (2023).
Marvin, G., Hellen, N., Jjingo, D., & Nakatumba-Nabende, J. Prompt engineering in large language models. In International conference on data intelligence and cognitive informatics (pp. 387–402). (Springer, 2023). https://doi.org/10.1007/978-981-99-7962-2_30
Chen, B., Zhang, Z., Langrené, N., & Zhu, S. Unleashing the potential of prompt engineering in large language models: A comprehensive review (2023). arXiv:2310.14735.
Yong, G., Jeon, K., Gil, D. & Lee, G. Prompt engineering for zero-shot and few-shot defect detection and classification using a visual-language pretrained model. Comput.-Aid. Civ. Infrastruct. Eng. 38(11), 1536–1554. https://doi.org/10.1111/mice.12954 (2023).
Chirkova, N., Rau, D., Déjean, H., Formal, T., Clinchant, S., & Nikoulina, V. Retrieval-augmented generation in multilingual settings (2024). arXiv:2407.01463.
Minixhofer, B., Paischer, F., & Rekabsaz, N. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv:2112.06598 (2021) https://doi.org/10.18653/v1/2022.naacl-main.293
Remy, F., Delobelle, P., Avetisyan, H., Khabibullina, A., de Lhoneux, M., & Demeester, T. Trans-tokenization and cross-lingual vocabulary transfers: Language adaptation of LLMs for low-resource NLP. arXiv:2408.04303 (2024).
Guo, P., Ren, Y., Hu, Y., et al. Steering large language models for cross-lingual information retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 585–596). (ACM, 2024). https://doi.org/10.1145/3626772.3657819
Wang, Y., Liu, Y., Yu, F., et al. CVLUE: A new benchmark dataset for Chinese vision-language understanding evaluation. arXiv:2407.01081. (2024) https://doi.org/10.48550/arXiv.2407.01081
Nie, Z., Zhang, R., Feng, Z., et al. Improving the consistency in cross-lingual cross-modal retrieval with 1-to-K contrastive learning. arXiv:2406.18254. (2024) https://doi.org/10.1145/3637528.3671787
Ren, Y. et al. Dual visual align-cross attention-based image captioning transformer. Multimed. Tools Appl. https://doi.org/10.1007/s11042-024-19315-4 (2024).
Song, Z. et al. Embedded heterogeneous attention transformer for cross-lingual image captioning. IEEE Trans. Multimed. 26, 9008–9020. https://doi.org/10.1109/TMM.2024.3384678 (2024).
Lovenia, H., Mahendra, R., Akbar, S.M., et al. SEACrowd: A multilingual multimodal data hub and benchmark suite for Southeast Asian languages. arXiv:2406.10118. (2024) https://doi.org/10.48550/arXiv.2406.10118
Funding
This work was supported by the Researchers Supporting Project number (RSP2025R395), King Saud University, Riyadh, Saudi Arabia. Also supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R 343), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. “This work is Supported by the Department of Education of Guangdong Province (2024GCZX014; this work is supported by Key Area Special Project of Guangdong Provincial Department of Education (Grant Nos.6022210111 K, 2022ZDZX3071”. This work is Supported by the Post-doctoral Foundation Project of Shenzhen Polytechnic University (Grant No.6024331021 K).
Author information
Authors and Affiliations
Contributions
Conceptualization, Emran Al-Buraihy, Dan Wang; Data curation, Emran Al-Buraihy, T.H; Formal analysis, Emran Al-Buraihy, RWA; Methodology, Emran Al-Buraihy, Dan Wang K.Z, AAA; Software, Emran Al-Buraihy; Supervision, Dan Wang and Z.G.; Validation, Emran Al-Buraihy, A.A.A and T.H; Visualization, Emran Al-Buraihy; Writing – original draft, R.W.A. K.Z, Z.G.; Writing – review & editing, Dan Wang and Mohammed Saadeldin. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval
We confirm that relevant guidelines and regulations are carried out in all methods. This paper does not contain any studies involving humans or animals. This study does not require any ethical board approval because there was no human interaction in the study.
Competing interests
The authors declare no competing interests.
Statement of Ethics
Hereby, it is consciously assured that for the manuscript “AraTraditions10k: Bridging Cultures with a Comprehensive Dataset for Enhanced Cross-Lingual Image Annotation, Retrieval and Tagging” the following is fulfilled: 1) This material is the author’s original work, which has not been previously published elsewhere. 2) The paper is not currently being considered for publication elsewhere. 3) The paper reflects the author’s research and analysis truthfully and completely. 4) The paper properly credits the meaningful contributions of co-authors and co-researchers. 5) The results are appropriately placed in the context of prior and existing research. 6) All sources used are properly disclosed (correct citation). Literally copying of text must be indicated as such by using quotation marks and giving proper references. 7) All authors have been personally and actively involved in substantial work leading up to the paper and will take public responsibility for its content.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Al-Buraihy, E., Wang, D., Hussain, T. et al. AraTraditions10k bridging cultures with a comprehensive dataset for enhanced cross lingual image annotation retrieval and tagging. Sci Rep 15, 19624 (2025). https://doi.org/10.1038/s41598-025-02894-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-02894-z