Introduction

The development of the World Wide Web over the in recent years has completely changed the advertising landscape. Online advertising is evolving rapidly with time1. Nearly half of the world’s population uses the Internet, and people of every age group are connected to the Web. New platforms, ad types, and targeting capabilities are constantly emerging2.

In today’s world, online advertising is very important for the success of a business3. Research shows that today, the use of the Internet to search for information about products and services before purchasing them has increased dramatically4. This means that you cannot ignore the key role of online advertising in the success of your business, since most of your customers are reviewing products online. Every entrepreneur and marketer should use this modern age advertising tool.

Online advertising can increase customers and ultimately help increase profits. It is very easy to advertise an e-commerce store to a global audience with an online advertising strategy5. Because through internet advertising, you can introduce your business beyond your geographical ___location and reach your target audience around the world. You don’t have to travel anywhere to grow your business. You can easily communicate with your audience with the help of the Internet. In this way, your business will be accessible to millions of customers6.

Sooner or later potential customers convert into customers and help you earn more profit. Digital marketing on the Internet has become a vital part of today 's marketing strategies, and thus, advertisers use different methods to deliver targeted advertisements to users and increase click-through rates7. The user interest modeling is a crucial part of online advertising that aims to understand what users are interested in to then show them relevant advertisements. An effective user interest modeling is essentially crucial for online advertising effectiveness, wherein it can help to increase the return of investment for advertisers, and the user experience in turn.

However, user interest modeling is a dynamic and complicated task about the user behavior. A user’s interests and preferences can change rapidly, enriched by multiple elements, including the user’s browsing history, prior searches, and social media activity. User interest modeling methods have mostly relied upon static feature sets and have failed to account for the temporal dynamics and changes in user behavior. These are usually used to describe users according to a static list of features, such as demographics, interests, and behaviors, and don’t reflect the dynamic aspect of user engagement.

In addition, with the increasing accessibility of user interaction data, such as browsing behaviors, search queries, and social media, exciting opportunities arise to improve user interest modeling. This data is inherently sequential, with each interaction building on past interactions. Therefore, accurately modeling these sequential dependencies is critical for user interest modeling, as it allows for more precise detection of user behavior and preference.

This research introduces a novel model for user interest using Incremented Parrot Optimizer (UPO) based Bi-LSTM network, which offers better results. The proposed method abstracts user behavior in temporal sequence and captures the temporal nature of user interest and preference.

Related works

Recent developments in deep learning have demonstrated potential in overcoming these challenges, with approaches like neural networks and recurrent neural networks (RNNs) being investigated for user interest modeling. For example, The work presented in8 shows that it is possible to use a graph-based approach for recommendation systems such as our work on predicting user interests and reactions in online display advertising based on Bidirectional LSTM Optimized by Updated Parrot Optimizer.

Compared to our Bidirectional LSTM with Updated Parrot Optimizer for user interest modeling, the results of9 demonstrated the effectiveness of the generative approach for recommender systems.

Graph cross-correlated recommender network was proposed in10, which is also related to our work since we are using Bidirectional LSTM for user interest representation and the paper indicates the effectiveness of graph methodology in recommender systems.

The method introduced in11, for example, presented unsupervised gradient semantic models’ to align users which are used in our study to choose prediction models for online display advertising from Bidirectional LSTM approaches optimized by Updated Parrot Optimizer.

Related to our work on user interest modeling 12, demonstrated the capability of using recommendation systems based on pupil morphology and also shows the possibilities of using physiological signals for personalized advertisements.

In13, the author studied spatiotemporal app usage behavior for user profiling, similar but less effective than Bidirectional LSTM.

The work in14 studied finding spatiotemporal patterns of mobile application usage, which is related to our work on predicting users’ likes and reactions in online display advertising.

In the following, Qu et al.15 proposed a Product-based Neural Networks (PNN) using an embedding layer for learning the categorical data’s distributed illustration, fully connected layers for exploring high-order feature communications, and a product layer for capturing collaborative designs among interfiled classes. It was demonstrated by the outcomes that the suggested PNN could perform better than other optimizers. In Criteo dataset, IPNN achieved the best AUC value of 77.79%, and OPNN achieved the best AUC value of 81.74% in iPinYou dataset.

Gharibshah et al.16 two diverse deep learning structures were suggested in the present study called LSTMip and LSTMcp for user interest prediction and click prediction modeling. In this study, the information page was gathered that was represented to the users in place of temporal sequence, and LSTM was employed to learn the attributes that display user interest as latent attributes. As was demonstrated by the outcomes, consideration of the temporal variations and sequences resulted in enhancements of AD click forecast and Ad response forecast. LSTMcp achieved 0.3140, 0.5183, 0.3910, 0.7003, and 0.8481 for Precision, Recall, F1-measure, AUC, and Accuracy, respectively. Furthermore, LSTMip achieved the values of 0.4015, 0.3447, 0.3845, 0.7624, and 0.3845 for Precision, Recall, F1-measure, AUC, and Accuracy, respectively.

Yang et al.17 suggested Operation-aware Neural Network (ONN) that learned diverse illustrations for various procedures. The findings of 2 diverse large real-world ad click/conversion datasets represented that the suggested network outperformed the other network presented in the current study, in online- and offline-training environment. Considering the Criteo dataset in online-training setting, ONN achieved the values of 0.43748 ± 7.63e − 5, 0.81020 ± 5.41e − 5, 0.49823 ± 1.11e − 4, and 0.37581 ± 3.80e − 5 for Logloss, AUC, Pearson’s R, and RMSE, respectively. Moreover, considering the Tencent Ad dataset in online-training setting, ONN achieved the values of 0.10594 ± 1.76e − 4, 0.82792 ± 9.68e − 4, 0.27012 ± 8.93e − 4, and 0.15763 ± 1.25e − 4 for Logloss, AUC, Pearson’s R, and RMSE.

Kim et al.18 aimed to enhance the prediction Click Through Rate (CTR), for which reason the Deep User Segment Interest Network was offered. In the current investigation, 3 diverse layers were suggested to improve the efficacy, including a segment interest activation, a segment interest extractor, and an individual interest extractor. TaoBao dataset was used in this study to validate the prediction improvement of CTR by representing the interest of segment. The suggested algorithm could gain the AUC value of 0.0029 that its sequence length of behavior was 100. This efficacy illustrated the best enhancement compared to other baseline approaches.

Zhou et al.19 suggested a hybrid model called DGRU that combined Gated Recurrent Unit (GRU) and Factorization-Machine Based Neural Network (DeepFM). The DeepFM module had the responsibility of autonomously combining features, while the GRU module was designed to capture user preferences and modifications over time. The GRU module received a sequence of 1 and 0 demonstrating user clicks, providing insights into user preferences. Additionally, the concise format helps prevent overfitting issues. The investigation using three actual datasets showed the suggested model outperformed other existing network in terms of forecast efficacy and robustness of CTR.

Existing methodologies frequently depend on oversimplified models that do not adequately represent the intricate and sequential characteristics of user behavior. For example, numerous techniques treat user behavior as a bag-of-words or employ basic Markov models, which overlook the temporal dependencies and interrelations among user interactions.

Dataset description

In this study, for modeling and analyzing the effectiveness of the model, it has been evaluated based on two datasets, including “Post-View Click Dataset” and the “Multi-Campaign Click Dataset” which are explained below.

Post-view click dataset

This dataset has been specifically created to validate models for predicting binary user clicks. It comprises 5.6 million anonymous user records derived from the log events of a single day, which document a chronological array of request categories indicative of user browsing patterns16. The dataset features two categories of user responses: positive (click) and negative, with a positive response occurring when a user clicks on a post following a series of impressions. Nonetheless, the dataset is challenged by a considerable class imbalance, as positive responses are infrequent in the realm of digital advertising. To mitigate this issue, we implemented random down-sampling to establish a balanced class distribution of 10% positive and 90% negative samples within the dataset. In the following way, the approach for performing random down-sampling to obtain the balanced “Post-View Click Dataset” mitigates the class imbalance on the training data in order to improve the machine learning model’s generalization. Although, at the same time, this would result in losing some useful information from the majority class, which may lead to a minor decrease in overall predictive accuracy of the model at the expense of making it more sensitive to minority class instances.

Multi-campaign click dataset

This dataset contains historical records of user interactions across multiple campaigns, where a positive response is characterized by a user clicking on one of ten different campaign types20. The main purpose of this dataset is to validate multi-class user interest prediction models. However, a significant challenge encountered with this dataset is the extreme variability in sequence lengths, with a considerable number of sequences being notably short, frequently comprising fewer than three time-steps. To tackle this issue, we devised an innovative method that integrates bucketing and padding, as outlined in Sect. 3.2, to effectively manage the inconsistent sequence lengths and facilitate precise analysis.

Bidirectional LSTM network for user interest and response prediction

Bi-LSTM (Bidirectional Long Short-Term Memory) network is a class of RNNs that pairs together two LSTM units, one processing the input from the beginning to the end, and another from the end to the beginning21. With such dual processing ability, the network is further able to collect contextual information from both past and future and is thus an ideal candidate for modeling user behavior in online display advertising. The Bi-LSTM network includes two separate LSTM units:

  1. 1.

    The Forward LSTM (also denoted \(LST{M}_{F}\)) processes the input sequence from the beginning to the end, allowing it to learn information about previous events within the input sequence.

  2. 2.

    The Backward LSTM (referred to as \(LST{M}_{B}\)) processes the input sequence in a reverse fashion, analyzing it from the end to the beginning, which enables it to acquire information related to future occurrences.

Then the outputs of both LSTMs are concatenated to produce the output. Figure 1 shows the architecture of the Bidirectional LSTM Network.

Fig. 1
figure 1

Architecture of the bidirectional LSTM network.

By considering \(X = (x1, x2,..., xn)\) as the input sequence, where each \({x}_{i}\) represents the \(i-th\) feature vector. The Bi-LSTM network analyzes this sequence in both forward and backward directions.

Forward LSTM

The forward LSTM processes the input sequence from \({x}_{1}\) to \({x}_{n}\). At each time step \(t\), the LSTM cell modifies its hidden state \({h}_{t}\) and cell state \({c}_{t}\) according to the following equations:

$${h}_{t}=\sigma \left({W}_{h}\times {h}_{t-1}+{U}_{h}\times {x}_{t}+{b}_{h}\right)$$
(1)
$${o}_{t}=\sigma \left({W}_{o}\times {h}_{t}+{U}_{o}\times {x}_{t}+{b}_{o}\right)$$
(2)
$${h}_{t}={o}_{t}\times \text{tanh}\left({c}_{t}\right)$$
(3)

where \(\sigma\) denotes the sigmoid function, tanh represents the hyperbolic tangent function, while \({W}_{h}\), \({U}_{h}\), \({W}_{c}\), \({U}_{c}\), \({W}_{o}\), and \({U}_{o}\) are matrices of learnable weights, and \({b}_{h}\), \({b}_{c}\), and \({b}_{o}\) are vectors of learnable biases.

Backward LSTM

The backward LSTM processes the input sequence from \({x}_{1}\) to \({x}_{t}\). At each time step t, the LSTM cell updates its hidden state \({h}_{t}\) and cell state \({c}_{t}\) using the following equations:

$${h}_{t}=\sigma \left({W}_{h}\times {h}_{t+1}+{U}_{h}\times {x}_{t}+{b}_{h}\right)$$
(4)
$${c}_{t}=\sigma \left({W}_{c}\times {h}_{t+1}+{U}_{c}\times {x}_{t}+{b}_{c}\right)$$
(5)
$${o}_{t}=\sigma \left({W}_{o}\times {h}_{t}+{U}_{o}\times {x}_{t}+{b}_{o}\right)$$
(6)
$${h}_{t}={o}_{t}\times \text{tanh}\left({c}_{t}\right)$$
(7)

It is essential to recognize the difference in the recurrence relation, where \({h}_{t+1}\) is employed in place of \({h}_{t-1}\).

The outputs generated by both LSTMs are combined through either concatenation or addition to produce the final output.

$${y}_{t}=\left[{h}_{t}^{F}; {h}_{t}^{B}\right] \left(concatenation\right)$$
(8)
$${y}_{t}={h}_{t}^{F}+{h}_{t}^{B}\left(addition\right)$$
(9)

where \({h}_{t}^{F}\) and \({h}_{t}^{B}\) represent the hidden states of the forward and backward LSTMs at time step \(t\), respectively.

The Bi-LSTM network undergoes training by employing an appropriate loss function, which may include mean squared error or cross-entropy, alongside an optimization algorithm, such as stochastic gradient descent or Adam.

To enhance the performance of Bi-LSTM networks through the application of metaheuristics, the selection of an appropriate loss function is essential. Given that Bi-LSTM networks are frequently employed for tasks related to predicting user interests and responses.

For prediction tasks, the following cross-entropy has been used. This function erasures the performance of the prediction model. The formula for the function is expressed as follows

$$L=\left(\frac{1}{n}\right)\times \sum \left[w\times \left({y}_{i}\times -\text{log}(p\left({x}_{i}\right)\right)\right]+\left[\left(1-{y}_{i}\right)\times -\text{log}\left(1-p\left({x}_{i}\right)\right)\right]$$
(10)

where, \(n\) desires the sample quantity, \(w\) specifies the cost associated with positive errors in comparison to the misclassification errors of negative instances, \({y}_{i}\) represents the target label and has been taken in the range between 0 and 1, \(p\left({x}_{i}\right)\) signifies the network predicted value which shows the probability that the sample \({x}_{i}\) will elicit a click response ultimately and has a value between 0 and 1.

When employing a Bi-LSTM network for a prediction task using MAPE as the loss function, it is essential to optimize the network’s architecture to achieve minimal MAPE. The following parameters warrant consideration:

  1. 1.

    Number of Layers (L): The layers number within the Bi-LSTM network can significantly influence its capacity to discern intricate patterns in the data. While augmenting the number of layers may enhance performance, it also raises the potential for overfitting.

  2. 2.

    Number of Units (U): The unit’s number in each layer plays a crucial role in performance outcomes. An increase in the number of units can enhance the network’s learning capacity, yet it may also heighten the risk of overfitting.

  3. 3.

    Input Sequence Length (T): The length of the input sequence can determine the extent of contextual information available for the network’s predictions. Lengthening the input sequence can enrich the context, although it may also elevate the risk of overfitting.

  4. 4.

    Output Sequence Length (T'): The length of the output sequence influences the number of predictions generated by the network at each time step. Extending the output sequence length can yield additional predictions, but it may also increase the likelihood of overfitting.

  5. 5.

    Dropout Rate (D): Dropout serves as a regularization method that mitigates overfitting by randomly omitting units during the training process. While a higher dropout rate can effectively reduce overfitting, it may also lead to underfitting.

  6. 6.

    Learning Rate (LR): The learning rate determines the speed at which the network assimilates information from the data. A higher learning rate may facilitate quicker convergence; however, it can also result in the network overshooting the optimal solution.

  7. 7.

    Batch Size (BS): The batch size dictates the number of samples utilized to calculate the gradient at each iteration. An increase in batch size can yield more consistent gradients, yet it may also heighten the likelihood of overfitting.

  8. 8.

    Epochs (E): The number of epochs specifies how many times the network is exposed to the complete training dataset. A greater number of epochs can enhance the training experience, but it may also elevate the risk of overfitting.

  9. 9.

    Activation Function (AF): The activation function governs the output of each neuron within the network. Typical activation functions employed in Bi-LSTM networks include ReLU, Tanh, and Sigmoid.

  10. 10.

    Optimizer (O): The optimizer regulates the manner in which the network adjusts its weights throughout the training process. Frequently used optimizers for Bi-LSTM networks comprise Adam, RMSProp, and SGD.

Updated Parrot Optimizer (UPO)

In this part, everything pertinent to entire theory and background of Parrot Optimizer has been represented and equated.

Inspiration

There is a particular type of parrot, called Pyrrhura Molinae, that has been considered popular for pet lovers due to this animal’s fascinating features and attributes, close relationship with its owner, and training ease. In the prior studies, it has been exhibited that this particular species has 4 diverse manners, including horror of strangers, communication, staying, and foraging22. These manners are the foundations of this optimizer’s development, which have been listed down and explained subsequently:

Foraging. This manner is really interesting, since the candidates forage in groups with small size, where there is much nutrition. These animals are capable of finding nutrition by moving toward it, while employing the presence of group and the situation of its owner. They improve their searching manner while utilizing visual hints and smell.

Staying. Regarding this manner, these animals settle in diverse zones on the body their owner on a random basis.

Communication. These friendly animals use various signals for communicating in their groups for spread of information and social purposes.

Horror of strangers. Instinct fear of strangers has been proved to be highly regular amid all species of these birds. These natural behaviors drive the animals to get far from strange candidates and search for shelter by getting close to their owners.

It is worth noting that these manners are really unpredictable, making the motivation to design this optimizer, since these behaviors happen on random basis within each iteration.

Mathematical model

Initialization of population

The initialization has been formulated by the following equation, while the size of swarm has been illustrated via \(N\), the highest quantity of iterations is demonstrated via \({\text{max}}_{iter}\), and lower and upper bounds of solution space have been, in turn, demonstrated via \(lb\) and \(ub\). This procedure has been displayed via the following equation:

$${X}_{i}^{0}=lb+rand\left(0, 1\right)\times (ub-lb)$$
(11)

where the random quantity has been depicted via \(rand(0, 1)\) that ranges from 0 and 1, and the situation of animal \(i\) is indicated via \({X}_{i}^{0}\).

Foraging manner

In PO, when foraging, the birds estimate the food’s situation by witnessing it or by taking into account the owner’s position. After that, they flock toward the estimated ___location. This means that their movement is calculated on the basis of an equation:

$${X}_{i}^{t+1}=\left({X}_{i}^{t}-{X}_{best}\right)\times Levy\left(\text{dim}\right)+rand\left(\text{0,1}\right)\times {\left(1-\frac{t}{{Max}_{iter}}\right)}^{\frac{2t}{{Max}_{iter}}}\times {X}_{mean}^{t}$$
(12)

where the current situation has been represented via \({X}_{i}^{t}\), and the situation of the following update has been indicated via \({X}_{i}^{t+1}\). The mean situation within the current population has been demonstrated via \({X}_{mean}^{t}\), and the distribution of Levy has been displayed via \(Levy(D)\), which has been employed to explain the flock of the animals.

The finest situation has been illustrated via \({X}_{best}\) searched from the beginning to present time, and the present ___location of the host is indicated by it as well. the present quantity of iterations is demonstrated via \(t\), the motion of the animal on the basis of the its situation relevant to its owner has been represented via \(\left({X}_{i}^{t}-{X}_{best}\right)\times Levy\left(\text{dim}\right)\), and witnessing the situation of population has been displayed via and \(rand(\text{0,1})\)\({\left(1-\frac{t}{{Max}_{iter}}\right)}^{\frac{2t}{{Max}_{iter}}}\times {X}_{mean}^{t}\) that has been considered as a whole to find the nutrition’s orientation.

The mean situation of the current swarm has been represented via \({X}_{mean}^{t}\), which has been computed using the subsequent equation:

$${X}_{mean}^{t}=\frac{1}{N}{\sum }_{k=1}^{N}{X}_{k}^{t}$$
(13)

Moreover, the Levy distribution has been accomplished using the formula below, and the value of \(\gamma\) is 1.5.

$$\left\{\begin{array}{l}Levy\left(dim\right)=\frac{\mu \times \sigma }{{\left|v\right|}^{\frac{1}{\gamma }}} \\ \mu \sim N\left(0,dim\right) \\ \begin{array}{c}v\sim N\left(0,dim\right) \\ \sigma ={\left(\frac{\Gamma \left(1+\gamma \right)\times \text{sin}\left(\frac{\pi \gamma }{2}\right)}{\Gamma \left(\frac{1+\gamma }{2}\right)\times \gamma \times {2}^{\frac{1+\gamma }{2}}}\right)}^{1+\gamma }\end{array}\end{array}\right.$$
(14)

Staying manner

This animal is significantly friendly that it suddenly flies to any part in body of its owner, and it stays in the situation for a while. This procedure is illustrated like below:

$${X}_{i}^{t+1}={X}_{i}^{t}+{X}_{best}\times Levy\left(\text{dim}\right)+rand\left(\text{0,1}\right)\times ones(1,\text{dim})$$
(15)

where, the all-1 vector of \(dim\) has been represented via \(ones(1,dim)\), the flock toward the host has been indicated via \({X}_{best}\times Levy(dim)\), and the procedure of stochastically stop at section of the body of the host has been depicted via \(rand\left(\text{0,1}\right)\times ones(1,dim)\).

Communication manner

The animals are naturally social creatures known for their close interaction in their own groups. Their communication includes both communicating while staying in one place and flying to the flock. Regarding this optimizer, both manners are considered to happen with the same likelihood, and the average situation of the present population has been used to represent the middle of the flock. This can be illustrated as:

$${X}_{i}^{t+1}=\left\{\begin{array}{l}0.2\times rand\left(\text{0,1}\right)\times \left(1-\frac{t}{{Max}_{iter}}\right)\times \left({X}_{i}^{t}-{X}_{mean}^{t}\right), P\le 0.5\\ 0.2\times rand\left(\text{0,1}\right)\times \text{exp}\left(-\frac{t}{rand\left(\text{0,1}\right)\times {Max}_{iter}}\right), P>0.5\end{array}\right.$$
(16)

where the procedure of a candidate that joins a group of these animals with the purpose of communication has been represented via \(0.2\times rand\left(\text{0,1}\right)\times \left(1-\frac{t}{{Max}_{iter}}\right)\times \left({X}_{i}^{t}-{X}_{mean}^{t}\right)\), and the procedure of a candidate that flies away after communication has been indicated via \(0.2\times rand\left(\text{0,1}\right)\times \text{exp}\left(-\frac{t}{rand\left(\text{0,1}\right)\times {Max}_{iter}}\right)\). These manners have been considered feasible, which have been conducted while employing a stochastically produced \(P\), ranging from 0 to 1.

Fear of stranger manner

Typically, birds show an innate horror of unfamiliar individuals, and these parrots are no different. Their tendency to keep their distance from unknown individuals and to seek security with their owners in order to find a safe place has been represented below:

$${X}_{i}^{t+1}={X}_{i}^{t}+rand\left(\text{0,1}\right)\times \text{cos}\left(0.5\pi \times \frac{t}{{Max}_{iter}}\right)\times \left({X}^{best}-{X}_{i}^{t}\right)-\text{cos}\left(rand\left(\text{0,1}\right)\times \pi \right)\times {\left(\frac{t}{{Max}_{iter}}\right)}^{\frac{2}{{Max}_{iter}}}\times ({X}_{i}^{t}-{X}_{best})$$
(17)

where the reorientation procedure to fly to the owner has been represented via \(rand\left(\text{0,1}\right)\times \text{cos}\left(0.5\pi \times \frac{t}{{Max}_{iter}}\right)\times \left({X}^{best}-{X}_{i}^{t}\right)\), and the procedure getting far from the aliens has been indicated via \(\text{cos}\left(rand\left(\text{0,1}\right)\times \pi \right)\times {\left(\frac{t}{{Max}_{iter}}\right)}^{\frac{2}{{Max}_{iter}}}\times ({X}_{i}^{t}-{X}_{best})\).

Updated Parrot Optimizer (UPO)

To enhance the exploration and exploitation capabilities of the Parrot Optimizer (PO), we propose several modifications, including adjustments to agent initialization, the design of triangular topological units, and the local aggregation function. The main goal is to establish a balanced strategy that broadens the search space (exploration) while concurrently concentrating on the most promising solutions (exploitation).

To increase randomness and diversity in population initialization, we recommend shifting from a uniform distribution (represented by \(w\left(i,D\right)\)) to more advanced distributions. By utilizing a quasi-random sequence like the Sobol sequence can promote a more uniform distribution of the initial agents.

$${X}_{i}^{t+1}={X}_{i}^{t}+w\left(i,D\right)\times \text{cos}\left(0.5\pi \times \frac{t}{{Max}_{iter}}\right)\times \left({X}^{best}-{X}_{i}^{t}\right)-\text{cos}\left(w\left(i,D\right)\times \pi \right)\times {\left(\frac{t}{{Max}_{iter}}\right)}^{\frac{2}{{Max}_{iter}}}\times ({X}_{i}^{t}-{X}_{best})$$
(18)

Using the Sobol Sequence function (known as \(w(i,D)\)) the \({i}^{th}\) vector of D quasi-random coordinates is produced. This coordinate system is expressed as a fractional value between 0 and 1, which indicates the position of a point within a unit hypercube for each dimension of the Sobol sequence. The numbers in the list represent the dimensions.

$$w\left(i,D\right)= \left(\frac{9\times exp\left(-\beta \times \left(st{d}_{dev}\left(t\right)\right)\right)}{1+exp\left(\alpha \times \left(t -\frac{T}{2}\right)\right)}\right)$$
(19)

The term \(st{d}_{dev}\left(t\right)\) is used to represent standard deviation of the solutions at iteration which provides a good indication of the search diversity. Furthermore, the parameters \(\alpha\) and \(\beta\) are important to define the shape of the logistic function.

Validate the algorithm

This section illustrates how the proposed UPO algorithm outperforms other metaheuristic algorithms based on convergence curves for some benchmark functions from 2018 IEEE Congress on Evolutionary Computation (CEC2018). The curves illustrate relative speed and accuracy of convergence towards the optimal solution for each approach through iterative evaluations. We can evaluate both the convergence speed and stability by plotting the average fitness values against the iterations. This plot provides a visual depiction of the quantitative results shared above and illustrates UPO’s prowess in reaching optimal performance more quickly and reliably, reinforcing its viability for optimization in complex scenarios such as deep learning hyperparameter adjustment. Figure 2 shows the comparative convergence performance.

Fig. 2
figure 2

Sample comparative convergence performance: (A) F1, and (B) F3.

As shown in the convergence analysis, the UPO outperforms other competing optimizers including Lévy Flight Distribution (LFD)23, Harris Hawks Optimization (HHO)24, Equilibrium Optimizer (EO)25, World Cup Optimization (WCO)26, and Tunicate Swarm Algorithm (TSA)27 in every possible scenario. UPP outperforms in the early iterations which indicates that it is able to identify promising regions of the search space faster (it is observable that the evolution of fitness values is sharper at the early iterations). Figure 2a shows the approaches in function F1 with an average fitness value of 4.22 for UPO in 200 iterations while average fitness values for LFD and HHO become stable earlier at near 8.67 and 14.29, respectively. Likewise, in function F3, it can be seen that UPO converged better than both EO (12.55) and TSA (16.69) with a final fitness of 3.75. These observations confirm validating UPO that has a better perform of exploration and exploitation. The relatively smoother descent of UPO’s curves suggests a better stability, resulting in less fluctuations between premature convergence and entrapment into local optima.

Discussions about enhancing the problem by the proposed UPO

UPO makes some important changes like Preferring to use more advanced distributions (Sobol sequence) instead of uniform distribution for agent initialization. This would intuitively also increase randomness and diversity in population initialization phase (however, the paper here could further explain this change about how it would be different in terms of convergence speed or solution quality). The exploration–exploitation trade-off is addressed too, with the design of triangular topological units and the local aggregation function proposed as approaches to balance exploratory and exploitative behaviors, but how these elements can assist particularly in avoiding premature convergence or being trapped in local optima from the perspective of user interest modeling is not discussed much. Moreover, while the equations governing the basis of these describers are presented throughout the text, a broader explanation as to how each specific one contributes to the overall improved efficiency and robustness of the algorithm would really help to consolidate understanding. While such improvements can be seen to be helpful in sequential data modeling tasks (e.g., capturing temporal dependencies between users’ behaviors), these are somewhat abstract in nature as a concrete example or visualization to make the connection from the modifications to the performance gain shown in the experimental results clearer would have been beneficial. So the main take-aways would be more practical, but broadening on these points would reinforce the rationale for UPO’s use and provide additional insights into its application on other similar deep learning-based optimization calculation Problems.

Problem definition

Consider a group of users, referred to as \(U\), which includes \({u}_{1}, {u}_{2}, {u}_{3},\dots , {u}_{n}\), alongside a set of occurrences, denoted as \(R\). Each occurrence, represented as \({r}_{{u}_{j}}^{{t}_{i}}\), indicates the presentation of an advertisement to a specific individual uj within a defined context at a particular time \({t}_{i}\). This occurrence is represented as a real-valued vector (\({r}_{{u}_{j}}^{{t}_{i}}\in {\mathbb{R}}^{d}\)). In display advertising, the term “context” pertains to the webpage accessed by the individual, characterized by a hierarchical arrangement of webpage category IDs that offer varying levels of detail.

Let \({\mathbb{C}}\) symbolize the established set of webpage categories, comprising (\({c}_{1}, {c}_{2},..., {c}_{|{\mathbb{C}}|}\)), where \(|{\mathbb{C}}|\) signifies the total number of categories16. For a webpage accessed by individual \({u}_{j}\) at a specific time step \({t}_{i}\), the webpage categories can be illustrated as an array, such as \([{c}_{1}, {c}_{2}, {c}_{3},\dots ]\). For each individual \({u}_{j}\in U\), we compile the sequential order of webpages visited, denoted as \({r}_{{u}_{j}}=\left[{r}_{{u}_{j}}^{{t}_{1}}, {r}_{{u}_{j}}^{{t}_{2}},\dots , {r}_{{u}_{j}}^{{t}_{m}}\right]\). Given the differences in the number of websites visited by individuals, we represent \({r}_{{u}_{j}}\in {\mathbb{R}}^{m\times d}\), where \(m\) indicates the maximum sequence length.

Consequently, with the historical data of all individuals represented as \(R=[{r}_{{u}_{1}}, {r}_{{u}_{2}},\dots , {r}_{{u}_{n}}]\), where \(R\in {\mathbb{R}}_{n\times m\times d}\), and \(d<|{\mathbb{C}}|\), the aim is to use the historical activities of individuals as a sequential record of requests leading up to an arbitrary time step \({t}_{i}\) to get user engagement forecasting, which focuses on estimating the likelihood that a user will interact with an advertisement at a specific time step \({t}_{i}\) by producing a click response, thus categorizing this task as a binary classification problem, and user preference prediction, which seeks to predict the category of campaign advertisement that a user is inclined to click on, thereby defining this task as a multi-class classification problem.

User modeling

The online display advertising environment presents two major challenges in using deep learning methodologies, such as our proposed Bi-LSTM/UPO model, for modeling user behavior: the inherent complexity of user data and the variability in user interactions. Specifically, the data is marked by two primary issues, including a multitude of categorical attributes, as a single webpage may fall into several categories, and a variation in sequence lengths, as users’ online activities and responses which differ in duration and frequency over time.

Consequently, this leads to a dataset consisting of sequences of varying lengths, where each time step may include a sole set of features. In our model, this complexity has been addressed by integrating historical data on user behavior, which encompasses page category IDs for the webpages a user has visited within a designated timeframe. This information is represented as a variable-length array of category IDs, indicated as \([{c}_{1}, {c}_{2},\dots ]\), as illustrated in Table 1.

Table 1 The information of dataset.

The methodology conceptualizes individual audience members and their behaviors as intricate, multi-dimensional temporal sequences, with each sequence consisting of a series of events that unfold over time. Each entry in the table represents a distinct audience member, while the columns \({t}_{1}, {t}_{2}, {t}_{3},\dots , {t}_{n}\) illustrate the chronological progression of events within their sequence. The temporal relationships among events are defined such that if \(i\) is greater than \(j\), the event at ti follows the event at \({t}_{j}\).

The categorical labels \({c}_{1}, {c}_{2}, {c}_{3},\dots ,{c}_{m}\) signify the Interactive Advertising Bureau (IAB) tier 2 page category linked to the web page accessed by the audience member at each time point. The presence of a click icon denotes that an Ad click event took place, although not all sequences result in such an event.

This table offers insight into the sequential input data employed for user modeling, emphasizing the intricacies involved in managing diverse and dynamic user behavior data.

Challenges

When dealing with sequences of varying lengths, it poses a significant challenge to effectively manage them for our machine learning model. A common method is to pad the shorter sequences with zeros to match the length of the longer ones; however, this method can be computationally intensive and may introduce bias into the outcomes.

We present an approach instead where we use a combination of padding and bucketing. This will (a) allow using temporal information present in the sequences, (b) minimize the number of padding symbols.

The first thing we do is look at the length of the sequences and we find that most sequences are actually quite short with a majority being under 100 which fits a power-law distribution. This insight lets us segregate the sequences into buckets based on ti their lengths, performing zero-padding on the shorter sequences in each bucket to match the larger sequences mediated. To address this issue, we use a method called pre-padding, which means padding zeros at the beginning of the sequence instead of the end since important information is usually found at the end which we want to maintain.

Consequently, we establish several buckets within our training samples, ensuring that sequences in each bucket share similar lengths that correspond to the dataset’s length range.

Each sample is allocated to a specific bucket based on its length, and padding is applied solely to accommodate the sample within the bucket, thereby avoiding excessive zero-padding. Furthermore, an ensemble learning strategy has been adopted, which entails training multiple models on various representations of the data. This is achieved by trimming the sequences to different lengths and subsequently training the proposed Bi-LSTM/UPO model for each representation. Ultimately, we merge the outputs of all models through a majority voting technique, enhancing the accuracy of our predictions while mitigating the risk of overfitting.

In Fig. 3, the lengths of user sequences have been presented for two categories of click users, defined as those who have engaged in a click event, and non-click users, who have not participated in any click events.

Fig. 3
figure 3

The distribution of sequence lengths for users (A) who clicked and (B) users who did not click.

Based on Fig. 3, the horizontal axis is generally used to invert lengths of user interaction sequences, while the vertical axis is used to account for how often those sequence lengths appear. This long page description is derived from the manuscript where the legend of the figure also suggests it is showing distribution of sequence lengths for clickers and nonclickers as shown by the results, the power-law distribution indicates the vast number of short user sequences, very few long user sequences.

The results indicate that the sequence lengths for both click and non-click users adhere to a power-law distribution, suggesting that a significant proportion of user sequences are relatively short, specifically those with lengths below 100.

One of the main weaknesses of the proposed model is its high computational complexity, since it relies on Bi-LSTM networks and the UPO optimization process, the process also requires robust hardware and a lot of preprocessing steps that also increase overhead. Random undersampling for class imbalance causes unwanted information loss and reduced much lower predictive accuracy (especially for minority class parts). While bucketing and padding can take care of variable sequence lengths, doing so can still introduce bias, e.g. if many shorter sequences get zero-padded and thus zero-pads dilute patterns. The study is focused on models that learn from user interaction data but in a sequential manner, so it struggles with noisy or missing data. The generalization to contexts other than the used datasets is yet to be explored, and the unexplainable nature of Bi-LSTM/UPO framework may hinder the widespread adoption of this framework in sectors where interpretability is critical. Thus, these limitations suggest that the explored architectures are lightweight, some methods adapted to handle imbalance datasets, and in some cases, ___domain adaptations should be further explored.

One other problem in this context is to provide how we possess a diverse array of page categories, such as “sports”, “news”, and “entertainment”, which we intend to use as features to characterize each page visited by a user. The challenge arises from the fact that each page may belong to multiple categories, and the number of categories varies across different pages.

The solution is to address this problem by employing a method known as “one-hot encoding”, which transforms these categories into a sole form of binary representation. The representation is just a vector of 0s and 1s that maps a series of categories to the specific position in the vector. If a page is associated with a particular category, the relevant position in the sequence is marked with a 1; otherwise, it is marked with a 0.For instance, if a user accesses a page at time step \({t}_{1}\) that includes the categories “sports” and “news”, we would represent this as a binary vector such as \([0, 1, 0, 1, 0, 0,..., 0]\), where the first 1 decision is “sports” and the second 1 decision “news”. The main downside to this method is generating too long sequences of 0s and 1s, which is potentially computationally intensive to process. This concern is particularly pronounced when dealing with a large number of categories.

An alternative solution for this problem is to implement a strategy that retains only the most frequently occurring categories. We rank the categories based on their frequency of appearance and retain only those that exceed a specified threshold. This approach effectively reduces the number of groups we must manage, enhancing efficiency.

Consider it this way: envision a large container filled with various colored balls, each representing a category. One-hot encoding is similar to forming an extensive sequence of 0s and 1s to indicate which balls are present in the container. However, when faced with an overwhelming number of balls, managing them becomes challenging. Therefore, we adopt the alternative method to retain only the most prevalent balls while disregarding the others.

As mentioned, we employed one-hot encoding, a significant preprocessing approach to convert categorical webpage data into binary vector format, to convert the structured data into a format appropriate for input into the Bi-LSTM network. One hot encoding involves creating a vector for each category and adding it as a feature, where will be a 1 or a 0 for each category if it is present or not in the data. If a page belongs to a category (or set of categories), e.g., “sports” and “news” then the page’s 1st and 2nd positions in the binary vector are 1 and all rest are 0. But this can result in very long binary vectors, especially if you have so many categories and can affect the computational efficiency and affect the training and performance of the model as well. To remedy, we just retain the groups that are quite frequent, meaning, at least 1,000 occasions within the dataset. This simplifies the original diving of the data and consequently decreases the dimension of input data, thus leading to a much lighter computation burden on the algorithm without loss of relevant information. The bottom line, by studying certain categories which are very based on data to configure, allows the model to reason better based on those types of web pages that most of us would have seen, allowing to model better the patterns in the user clicks and interests. This preprocessing step significantly impacts final results depending on how it balances between model complexity and performance, which ultimately allows Bi-LSTM/UPO framework to enable a much higher accuracy and generalizability in applications related to both binary and multi-class classification.

User click and user interest prediction framework

Figure 4 provides a concise overview of the framework. we propose for addressing the challenges of user click and user interest prediction.

Fig. 4
figure 4

A concise overview of the framework.

The user click prediction task is treated as a binary classification problem, while user interest prediction is approached as a multi-class classification task aimed at categorizing the number of clicks across ten distinct advertising campaigns.

We bucket the sequence length of the dataset uniformly into a set number of buckets. The samples are then used to create a data representation for each bucket by pre-padding and truncating them to the specified sequence length. Then using ensemble learning, predictions are made. The objective function in User Interest Prediction Framework is similar to that of Eq. (10) when w is 1, applied to a categorical cross-entropy unweighted loss function, where \(p\left({x}_{i}\right)\) is the network output after the Softmax layer.

Simulation results

In this section, the simulation results of the proposed methodology are provided. We performed an extensive set of simulations to assess the performance of our approach for estimating the self-care capabilities of children with disabilities. Simulations used the “Post-View Click Dataset” and the “Multi-Campaign Click Dataset”. The data used to train and validate the Bi-LSTM/UPO model are all implemented on an NVIDIA GeForce RTX 3060 Laptop GPU to ensure graphics processing performance. It is built on an Intel Core i7-11260H Hexa-core processor with a base clock speed of 2.60GHz; thus, it provides considerable processing power. With 6GB of VRAM and 32GB of system memory, this setup managed large datasets well. The system, running Windows 11, used MATLAB R2019b for running deep learning jobs.

Different approaches including our proposed method were used for performance comparison. The data used for the model training was split into a training set and a validation set with a ratio of 80:20. The input sequential data were one-hot encoded into binary vectors, with multiple categorical campaign IDs, while avoiding infrequent ones at the data preprocessing level. We applied a threshold on the distribution of categorical campaign IDs, retaining only those categories that appeared in our dataset more than 1,000 times. Fivefold cross-validation was used to assess all the experiments.

Algorithm validation

The purpose of this section is to assess the performance of the newly designed Updated Parrot Optimizer (UPO) through a set of controlled experiments performed in a standardized testing framework. In order to evaluate its optimization abilities, the UPO algorithm will be tested on the CEC-BC-2017 benchmark suite, and compared against six other prominent metaheuristic algorithms:

  • Lévy flight distribution (LFD)23

  • Harris Hawks Optimization (HHO)28

  • Equilibrium Optimizer (EO)25

  • World Cup Optimization (WCO)29

  • Tunicate Swarm Algorithm (TSA)27

To obtain a complete and thorough assessment, every algorithm will be run for 20 executions over a range of testing functions. The detailed parameter configurations of every algorithm used in this study are summarized in Table 2.

Table 2 The specific parameter configurations utilized for each algorithm.

Statistical tools like means and standard deviations were used to evaluate our UPO algorithm compared to competing methods. Forms of analysis have been explored and shared. Table 3 summarizes the comparison results between the proposed UPO and other optimization algorithms.

Table 3 The results of the comparison between the proposed UPO and other optimization algorithms.

Table 3 results depict the superiority of the suggested UPO over six other metaheuristic algorithms: Lévy flight distribution (LFD), Harris Hawks Optimization (HHO), CO, World Cup Optimization (WCO), and Tunicate Swarm Algorithm (TSA).

For 20 test functions, UPO outperformed the state of the art in all but 5 instances, with the lowest average results in 15 functions. The average values achieved by UPO were 53.2%, 72.6%, 16.9%, 12.8% and 41.1% lower on F1, F3, F5, F7 and F9 than those of its next best-performing algorithm, respectively.

Moreover, the UPO illustrated smaller standard deviations than the 12 other functions out of the 20 functions, indicating its robustness and stability. The findings suggest that the UPO is an advanced optimizer which can handle complex optimization problems more efficiently and with higher precision and accuracy than the other algorithms considered in the study. Overall, the UPO algorithm is significantly better than the competing methods, suggesting it to be a good candidate for practical optimization problems.

Applying the proposed UPO to Eq. (10) provides the most optimal structure to our network that is shown in Table 4.

Table 4 Comparative results of the optimal values.

For this case, it can be noted that, the Achieved minimum value of MAPE is 0.118, which is the optimal point for Bi-LSTM network which was obtained with UPO metaheuristic. The LFD metaheuristic recorded a MAPE value of 0.125, which was marginally greater than the UPO’s value. Conversely, the HHO metaheuristic showed the highest MAPE value at 0.142, suggesting that it provided the least favorable solution out of the three metaheuristics.

Analyzing indexes

For evaluating the effectiveness of the proposed Bi-LSTM/UPO model, six measurement indicators have been used. The utilized indicators include Accuracy, Precision, Sensitivity, Specificity, Positive Predictive Value, and Positive Predictive Value.

Accuracy refers to the ratio of correctly predicted instances to the total number of instances. Precision indicates the ratio of true positives to the total number of predicted positive instances. Sensitivity, also known as recall, measures the ratio of true positives to the total number of actual positive instances. Specificity is defined as the ratio of true negatives to the total number of actual negative instances. Positive predictive value (PPV) defined as the proportion of true positives among all predicted positives, and negative predictive value (NPV) defined as the proportion of true negatives among all predicted negatives. Bellow the indicators mathematically defined.

$$Accuracy=\frac{tp+tn}{tp+tn+fp+fn}\times 100$$
(20)
$$Precision=\frac{tp}{tp+fp}\times 100$$
(21)
$$Sensitivity=\frac{tp}{tp+fn}\times 100$$
(22)
$$Specifity=\frac{correctly\, detected\, healthy\, cases}{Total\, healthy\, cases}$$
(23)
$$Positive\, Predictive\, Value\, (PPV)=\frac{correctly\, detected\, cancer\,cases}{detected\, cancer\, cases}$$
(24)
$$Negative\, Predictive\, Value (NPV)=\frac{correctly\, detected\, healthy\, cases}{detected\, healthy\, cases}$$
(25)

where \(tn\), \(fn\), \(tp\), and \(fp\) represent true negative, false negative, true positive, and false positive, respectively.

Comparative analysis

The effectiveness of the proposed Bi-LSTM/UPO for the extensiveness validation has been orientated based on comparison with five different recent methods, which are Product-based Neural Networks (PNN)15, LSTMip and LSTMcp (LSTMicp)16, Operation-aware Neural Network (ONN)17, Deep User Segment Interest Network (DUSIN)18, and combined Gated Recurrent Unit and Factorization-Machine Based Neural Network (DGRU)19.

Post-view click dataset

The findings from the fivefold comparative analysis of the Post-View Click Dataset are detailed in the subsequent Table 5, which offer an extensive summary of the performance of both the proposed model and the baseline models across various evaluation metrics.

Table 5 fivefold comparative analysis of the Post-View Click Dataset.

The performance comparison between Bi-LSTM/UPO model and other baseline models on varying terms indicates that, as can be seen from the results, the Bi-LSTM/UPO model outperforms all the baseline models. In fact, Bi-LSTM/UPO model reached the accuracy of 0.873, a measure that outperformed the second best model DGRU, by a margin of 0.009. For precision, the Bi-LSTM/UPO has a 0.871, which is 0.009 more than the value of DGRU. In terms of sensitivity, the Bi-LSTM/UPO model achieves a metric of 0.893, indicating an ability to retrieve 89.3% of positive samples, which is better than the sensitivity of DGRU (Δ = 0.010). The model also achieves a specificity of 0.859, meaning that it correctly predicts instances as negative 85.9% of the time.

For more exploration of the result, DGRU’s PPV is at 0.845, which shows that 84.5% of instances predicted as positive are positive while the Bi-LSTM/UPO model PPV has achieved 0.856 where only 0.011 higher than DGRU. At last, the model reached an NPV of 0.911, meaning 91.1% of negative predicted instances are indeed negative.

These results further suggest the effectiveness of the Bi-LSTM/UPO model in forecasting the post-view click behavior of a user. Compared to baseline models, its performance across different metrics testifies to the robustness and reliability of the method in modeling user behavior in online advertising.

Multi-campaign click dataset

The findings from the fivefold comparative analysis of the Multi-Campaign Click Dataset are detailed in the subsequent tables, which offer an extensive summary of the performance of both the proposed model and the baseline models across various evaluation metrics (see Table 6).

Table 6 fivefold comparative analysis of the multi-campaign click dataset.

The results indicate that also in this dataset, the Bi-LSTM/UPO model exceeds all baseline models in different performance metrics. Notably, the Bi-LSTM/UPO model records an accuracy of 0.870, exceeding the accuracy of the next highest model, DGRU, by 0.013. In terms of precision, the Bi-LSTM/UPO model achieves a score of 0.869, which is 0.014 higher than that of DGRU.

Regarding sensitivity, the Bi-LSTM/UPO model demonstrates a value of 0.891, signifying its ability to accurately identify 89.1% of positive instances, surpassing DGRU’s sensitivity by 0.018. The model also exhibits a specificity of 0.859, indicating that it correctly identifies 85.9% of negative instances. Furthermore, the Bi-LSTM/UPO model achieves a PPV of 0.862, meaning that 86.2% of instances predicted as positive are indeed positive, which is 0.015 higher than DGRU’s PPV.

Finally, the model reaches NPV equal to 0.915 meaning that 91.5% of cases predicted as negative are truly negative. Results show that Bi-LSTM/UPO outperforms all the other models tested across all five evaluation metrics over both the datasets. In both datasets, this model produces the best values for accuracy, precision, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The proposed model manages to produce highly accurate results due to its ability to obtain the temporal dynamics of both user behavior and interests in the optimized Bi-LSTM network complemented with the UPO algorithm.

In comparison, the other models, while demonstrating commendable performance, exhibit limitations in their ability to capture the sequential nature of user interactions. For instance, PNN and LSTMip depend on static feature sets, which may not sufficiently represent the temporal fluctuations in user behavior. Although LSTMcp and ONN take the sequential aspect into account, their effectiveness is constrained by their inability to adequately capture long-term dependencies. Meanwhile, the Deep User Segment Interest Network and DGRU show improved performance; however, their complexity and restricted capacity to capture temporal dynamics impede their effectiveness relative to the Bi-LSTM/UPO model.

Conclusions

With the advent of online digital advertising, it has become an integral part of modern-day marketing campaigns and advertisers rely on modeling user interests accurately to show them relevant advertisements and improve click-through rates. Yet, despite all its importance, user interest modeling is not an easy task thanks to the complexity and dynamics of user behaviors. Traditional techniques (based on static feature sets) have overlooked changes over time, i.e. temporal changes and system shifts in user behavior. Such deficiencies have inhibited new methods for user interest modeling that are both accurate and efficient, leading to less effective ad targeting and reduced return on investment to advertisers. Optimization improves Bidirectional Long Short-Term Memory (Bi-LSTM) network and depicted a novel approach for user interest modeling through this technique in this study. This successfully proposed an associated process by enhancing the structure of Bi-LSTM which offers a better performance with the aid of Updated variant of Parrot Optimizer (UPO). By representing user behavior as a temporal sequence, the method effectively captures the dynamic characteristics of user interests and preferences. The experimental results validated the efficacy of the approach in two primary tasks: forecasting the likelihood of a user clicking on an advertisement and estimating the probability of a user engaging with a specific type of ad campaign. In real-world advertising systems, the proposed model could have a profound impact on user interest modeling and ad targeting. This way, you can leverage its power to capture the temporal dynamics and sequential dependencies of users’ behavior to briefly allow advertisers to deliver more personalized as well as context-aware advertisements, together with higher click-through rate and user engagement. In an expression of how this might work, the system gets plugged into an existing ad-serving stack, and then consumes just-in-time user interaction records (browsing visit (IP, date, time), query URLs, social network posts) to build a profile of responses and preferences dynamically. Despite its potential, the implementation of such a system is not without its challenges, especially when it comes to the infrastructure required to process vast amounts of data efficiently and to perform model inference at scale considering the complexity of the training process for deep learning models such as Bi-LSTM networks. Lastly, real-time prediction in low-latency environments needs careful optimization of the model architecture and the hardware resources to be used, which may require hardware upgrades such as high-performance GPUs or cloud-based hardware solutions. There are also data privacy concerns as the model requires an immense amount of user interaction data, meaning it will need to follow regulations like GDPR or CCPA and implement secure data handling practices. However, the power of a state of the art Natural Language Processing (NLP) model in combination with the abundance of scalable computing resources available through cloud computing and improvements in the accuracy of predicting which users will show interest in a particular advertisement could lend itself well to positive improvements to online advertising systems. The results contributed to the refinement of more effective user interest modeling techniques, with potential implications for online advertising, recommendation systems, and personalized marketing. Exploring attention mechanisms in conjunction with the Bi-LSTM architecture is a possible avenue for future work, as those have been shown to improve models in ways that a bi-directional recurrent architecture cannot, by allowing focus on prominent temporal interactions. Furthermore, exploring the use of this framework in online advertising systems may be necessary to understand its scalability and adaptability in rapidly changing, high-speed systems.