Introduction

Cultural heritage (CH) has increasingly been recognized as a vital element of sustainability. Traditionally, sustainability debates have focused primarily on ecological and physical aspects—such as environmental conservation and urban infrastructure—while often overlooking the multifaceted contributions of CH. UNESCO has long underscored the significance of CH in sustainable development, as evidenced by the 2011 Recommendation on the Historic Urban Landscape (HUL)1 and its incorporation into the UN 2030 Agenda for Sustainable Development2. These frameworks demonstrate that CH not only bolsters environmental and economic sustainability but also plays a crucial role in promoting social equity and reinforcing cultural identity.

Many urban CH sites have been integrated into city systems, functioning as essential components that provide Cultural Ecosystem Services (CES) and enhance the well-being of urban residents3,4,5,6. Through the 1972 World Heritage Convention, UNESCO has spearheaded global CH conservation by designating 897 sites worldwide, and numerous countries and regions have also implemented policies to protect both tangible and intangible CH7. Despite these efforts, urbanization, commercialization, and tourism development continue to challenge CH preservation8. Overuse and rapid urban expansion threaten the integrity of many heritage sites, underscoring the need for a balanced approach between heritage protection and sustainable development9.

CH not only attracts higher tourist flows and augments local commercial value—thereby significantly influencing city investments10,11. but its excessive development and commercialization can also yield adverse effects12. For example, the protection of historical sites may constrain urban planning and contribute to traffic congestion, as observed in ancient urban areas like Beijing13. Moreover, CH can affect housing equity; in areas such as Hangzhou’s West Lake, the presence of CH has led to a significant premium in surrounding property prices, as it is viewed both as an irreplaceable resource and a key selling point for developers14.

Both the positive and negative impacts of CH are poised to have long-term and far-reaching effects on urban and CH sustainability. Especially for cities that center their tourism and consumption industries around these resources, achieving a balance between heritage protection and sustainable development is imperative4,5,15,16. Otherwise, CH may deviate from its authentic historical form, resulting in negative experiences for stakeholders and diminishing benefits for the broader community17,18,19,20. Subsequently, as a special form of urban green space, CH serves as the main body of the urban CES and promoting urban sustainable development. Exploring a win-win solution between CH and sustainable development is a concern for scholars, planners, and policymakers. Finally, adopting a human-centered sustainable development approach—one that is firmly based on residents’ well-being—offers a robust framework for ensuring equitable and enduring urban progress5,21.

With rapid urbanization, public engagement with urban environments has attracted widespread scholarly attention22,23. Prior studies on urban green spaces have demonstrated that natural settings can significantly contribute to residents’ physical and mental well-being, for instance, by reducing stress through activities like walking or enjoying scenic views24,25,26,27. Beyond their ecological benefits, cultural and spiritual values within urban environment are equally important28,29. Modern urban planning increasingly integrates cultural and natural landscapes, reflecting a shift toward holistic, people-centered development30.

Urban perception refers to how residents interpret and interact with their surroundings, including streets, parks, and historic sites31. While previous studies in the field of urban research have laid a valuable foundation by demonstrating how environmental perception influences well-being, understanding public perceptions of cultural space requires an additional focus on historical, cultural, and social dimensions7,32,33,34,35. In this context, modern human-centered urban planning increasingly emphasizes public engagement and participatory decision-making, specifically aimed at enhancing the management and conservation of CH36,37.

CH itself, which includes both tangible elements (e.g., historic buildings, gardens) and intangible aspects (e.g., traditions, folklore), is inherently linked to subjective perception and place attachment38,39. For example, while heritage sites in locations like West Lake, Hangzhou, contribute to both ecological and cultural experiences, their true value lies in the rich cultural narratives and emotional connections they foster among local communities15. Intangible CH, being harder to preserve, relies on social perception, education, and intergenerational transmission40,41. Consequently, CH conservation extends beyond material protection to include cultural significance and experiential continuity42.

Compared to natural environments, cultural heritage or landscape perception is more dependent on human mediation, involving multi-sensory engagement that shapes visitors’ experiences43. This highlights the need to study CH perception beyond visual elements, incorporating auditory, olfactory, and tactile dimensions. The perception of the site is shaped by multi-sensory interactions, with vision being the most extensively studied sense44,45,46. Factors such as color, shape, and spatial composition significantly influence visitors’ experiences47,48,49,50,51,52. However, other sensory modalities, such as soundscapes, scents, and textures, remain underexplored due to methodological challenges53. Emerging studies suggest that multi-sensory engagement enhances visitor satisfaction, cultural appreciation, and place attachment, emphasizing the need for a holistic approach to cultural perception54,55,56,57.

Despite its importance, multi-sensory CH perception remains understudied across different cultural contexts. Unlike purely natural settings, CH environments worldwide include both natural and human-made elements, such as architecture, cultural activities, and intangible heritage58. Subsequently, the sensory experience involved is not only a single visual factor; other senses may also affect tourists’ overall perception, evaluation results, willingness to pay, or other behavioral decisions59. Understanding multi-sensory experiences in various CH contexts not only provides feedback on heritage conservation strategies but also fosters cross-cultural learning and enhances the global appreciation of CH33. By analyzing the public perception across different geographical and historical settings, this research contributes to a broader discourse on CH perception, offering insights for the sustainable management of heritage sites worldwide.

Traditionally, studies on public perception of CH have relied on surveys, interviews, and controlled experiments60,61,62. While these methods provide valuable insights, they are often time-consuming, resource-intensive, and geographically limited63,64. In contrast, online reviews offer a rich, user-generated dataset that captures real-time, large-scale public perceptions of CH sites65,66. The increasing accessibility of CH tourism has led to an abundance of online narratives, reflecting visitors’ experiences, preferences, and cultural interpretations67,68,69,70, These online reviews—which include text, images, and ratings posted on platforms such as TripAdvisor, Google Reviews, and social media—serve as valuable data sources for analyzing public perceptions of CH. Compared to traditional survey methods, online reviews provide more spontaneous and diverse feedback, covering a broader range of visitor demographics and capturing multi-dimensional insights into CH experiences. Social media and review platforms have been widely used in urban studies, disaster response, and environmental planning, demonstrating their reliability as perception assessment tools62. For CH, integrating online user-generated content into heritage conservation can enhance public participation and decision-making, helping policymakers understand how visitors engage with cultural understanding and what improvements are needed71,72.

Despite the growing recognition of multi-sensory experiences in CH studies, previous research has rarely utilized large-scale online reviews to analyze visitor perceptions. Most existing studies either rely on small-scale surveys or analyze textual and visual data separately, failing to capture the holistic, multi-sensory experience of CH sites. Additionally, comparative studies across culturally similar heritage sites remain limited, making it difficult to identify shared characteristics, contextual differences, and opportunities for mutual learning.

To address this gap, this study leverages large-scale online reviews to investigate visitor perceptions of CH sites in Suzhou, China, and Kyoto, Japan, both of which share an East Asian CH context. By applying a multi-sensory analytical framework, this research aims to link subjective perceptions to objective physical spaces, uncovering how different sensory elements shape visitors’ experiences. Furthermore, this study explores how such insights can enhance CH management, allowing for targeted improvements based on visitor feedback and cross-cultural knowledge exchange.

This study examines how multi-sensory perceptions influence CH experiences by comparing CH sites in Suzhou and Kyoto through large-scale online reviews. The key research questions are: What are the unique multi-sensory characteristics of CH in Suzhou and Kyoto based on subjective perception? What are the key differences in sensory experiences between the CH of Suzhou and Kyoto? How do these multi-sensory experiences influence overall visitor perception in each city? How can multi-sensory elements be integrated into sustainable conservation strategies for CH?

By addressing these questions, this research provides a systematic approach to understanding visitor perceptions. It offers insights into the multi-sensory composition of public perception and the unique characteristics of different CH sites, thereby informing differentiated conservation measures and integrating subjective visitor experiences into management optimization and sustainable preservation strategies.

Methods

Study area

Suzhou and Kyoto are famous for their CH; both cities have many renowned UNESCO World Cultural Heritage (WCH) sites and many non-WCH sites protected by local governments. Although these non-WCH sites do not have the same high reputation as the WCH sites, they still attract many visitors and were also selected as our research sites. In total, 20 CH sites from Suzhou and 20 CH sites from Kyoto were selected for this study to investigate whether the perception of the two cities demonstrates differences in online reviews and how landscape and multi-sensory elements affect overall perceptions (Fig. 1 and Table 1).

Fig. 1: Research Site.
figure 1

a The red dots indicate the locations of the two cities. b World Cultural Heritage sites in Suzhou. Blue dots represent World Cultural Heritage sites, red dots represent non-World Cultural Heritage sites, and darker colors indicate a greater number of reviews. c World Cultural Heritage sites in Kyoto. Blue dots represent World Cultural Heritage sites, red dots represent non-World Cultural Heritage sites, and darker colors indicate a greater number of reviews.

Table 1 Cultural Heritage Sites Included in This Study

Suzhou is in East China, southeast of Jiangsu Province. It is one of China’s first national historical and cultural cities, with a history of nearly 2500 years73. The Suzhou Classical Garden and the Suzhou section of the Beijing-Hangzhou Grand Canal have been listed as UNESCO WCH sites74. Kyoto, located in western Japan, is an important city in the Osaka metropolitan area. Some of the city’s historic buildings and gardens were listed as WCH sites in 1994 as part of the Cultural Wealth of the Ancient Capital of Kyoto75.

Both Suzhou and Kyoto exhibit a rich historical and CH sites, boasting a long history with numerous gardens and historic buildings—several of which are recognized as WCH sites—thus making them prototypical CH cities. Moreover, both cities are economically developed and have a longstanding tradition of heritage protection and sustainable urban development. As prominent CH tourism destinations, they attract many visitors each year, generating ample online review data that supports comprehensive analysis of overall visitor perceptions. Collecting this big data further aids in understanding site perception and the distinctive urban characteristics of each locale. The similarities and differences between Suzhou and Kyoto provide a robust context for applying these methods to detect variations in multi-sensory experiences, ultimately helping to analyze the distinctive characteristics—especially the subtle multi-sensory differences—and to offer targeted recommendations for spatial environment management and heritage conservation.

Data collection—integrating photos and text for a holistic understanding of visitor perception

Previous studies have commonly used Visitor-Employed Photography (VEP) to analyze tourist perceptions, where visitors take photos to document their experiences and highlight visually significant elements76,77,78. However, while photos capture explicit preferences in terms of landscape features and spatial composition, they do not provide insight into the emotions, reasoning, or sensory experiences that influence those choices.

In contrast, textual reviews, often generated in unstructured formats, reflect visitors’ subjective evaluations, detailing their likes, dislikes, and overall impressions. However, these textual descriptions may lack clear spatial references, making it difficult to determine which specific landscape elements or cultural features contributed to sentiments.

By combining photo and text analysis, this study provides a more comprehensive understanding of CH perception, offering several key advantages. First, it enhances interpretability by using photos to identify specific landscape elements that attract visitors, while text provides the reasoning behind their preferences. Additionally, it captures multi-sensory experiences, as text reveals subjective perceptions and sensory details such as sound and atmosphere, which cannot be directly inferred from images alone. Finally, this approach provides a holistic understanding by allowing us to explore correlations between landscape elements and textual sentiment, offering a more comprehensive analysis of CH perception.

The review data for the CH sites in Suzhou and Kyoto came from China’s largest travel platform, Ctrip, and Japan’s Google Maps. Their popularity attracts many visitors to upload reviews, generating more exposure and access voluntarily; subsequently, the two platforms are widely used in research79,80. Moreover, because the visitors come from different language backgrounds, Google Automatic Translate was used to unify the languages. To address the requirement for formal agreements to harvest the surveys used in our research, we have created a Google Cloud account and are using a valid API key to access data through the official Google Maps Platform APIs, strictly adhering to the Google Maps Platform Terms of Service and Google Terms of Service; similarly, for Ctrip, we ensure adherence to Ctrip’s terms of service and privacy policies, collecting only publicly available reviews without personal information, used anonymously for research, and not for commercial purposes or competition with Google services81,82. Data collection was completed in October 2023, so the data collected is up to that time. After removing duplicate data and empty comments, Suzhou and Kyoto received 37,649 and 17,485 reviews, respectively (Table 1). To address the difference in review volume between Suzhou and Kyoto, we emphasize that such disparities are common in cross-site comparative studies and do not compromise the validity of perception analysis. Instead of relying on absolute review counts, we focus on content-based and proportional analysis, ensuring that key themes and sensory elements are assessed relative to the total dataset size. These methodological adjustments ensure that our results remain robust, comparable, and reflective of real visitor experiences, despite variations in sample size.

To ensure reliable cross-linguistic analysis, we adopted a translation strategy supported by recent advances in neural machine translation. Transformer-based models like Google Translate83and large language models such as GPT-484 have shown strong performance in preserving semantic and cultural context across languages. To verify translation quality, we sampled 200 reviews from Suzhou and Kyoto containing culturally specific terms (proportional from Kyoto and Suzhou). Comparing machine translations to human-annotated versions, we observed 93.0% agreement and a Cohen’s Kappa of 0.657 (p < 0.001), indicating substantial consistency (Table 2 and Table 3). To further reduce ambiguity, we applied synonym normalization and manually reviewed unclear terms during preprocessing.

Table 2 Confusion Matrix of Translation Consistency between Human and Machine on Culturally Specific Terms
Table 3 Cohen’s Kappa statistics for translation consistency

Deep learning processing

In this study, two deep learning (DL) techniques are employed to comprehensively mine effective information from online reviews. We adopt a full convolutional neural network (FCN) for semantic segmentation of images to extract landscape element information, and natural language processing (NLP) for text analysis to capture sensory descriptions and identify CH features (Fig. 2).

Fig. 2
figure 2

Picture and text processing workflow.

The application of DL in multidisciplinary fields has greatly enhanced our ability to process various data types. For instance, DL-based analysis of large volumes of street view maps is widely used across many research domains37,85. The application of DL in multidisciplinary fields has greatly enhanced our ability to process various data types. For instance, DL-based analysis of large volumes of street view maps is widely used across many research domains80. Semantic analysis of photographs and emotional analysis of online text have become common practices—for example, in studies exploring the relationship between landscape facilities and residential rental prices86. Although deep learning algorithms remain somewhat immature, they have already proven to be superior to manual methods, especially when processing and analyzing multimodal data such as text, images, and audio87. For instance, some studies have demonstrated that the use of convolutional neural networks (CNNs) can improve data accuracy by more than 15%69.

Geo-tagged photos from social media platforms are typically used to reveal visitors’ landscape preferences, but most research has primarily focused on their geographic and temporal distribution rather than delving into the rich, underlying information. Manual extraction of elements from photos in online reviews is inefficient70,88. Therefore, we employ FCN for automated image segmentation. Similarly, we use NLP methods to analyze the text in online reviews by clustering feature words with consistent word vectors, thereby facilitating both statistical and emotional analyses80. Given the absence of a mature dataset for training models to identify characteristic sensory vocabulary, we followed the approach of Koblet et al., who used a dictionary-based method to compile lists of sensitivity-related adjectives and nouns, and created a tailored vocabulary list for this study6. Based on this, we created the vocabulary list for this study. Python 3.7.4 was used for data acquisition, model construction, and processing.

Visitors tend to post photos that reflect their own preferences. Traditional methods such as VEP are widely used to understand visitor perceptions78,89. However, manually classifying or counting landscape elements from a large volume of photos is both labor-intensive and prone to inaccuracies.

The FCN method addresses these challenges by replacing the fully connected layers of traditional semantic segmentation with convolutional layers. This allows FCN to accept input images of any size and to use deconvolution layers to up sample feature maps back to the original image size, enabling pixel-level classification, consequently, we adopted the FCN method for photo processing88. For training the model, we used the ADE_20K dataset—annotated and published by the MIT CSAIL Computer Vision Group90,91,92. This dataset comprises 25,000 multi-scene photos annotated with 150 object categories, making it suitable for analyzing the complex, multi-scene images generated by visitors93.

In line with previous studies and the aims of our research, we removed irrelevant scene elements, retained 144 types of markers, and consolidated these into 12 major landscape element categories for subsequent model training (Table 4)6,47. In total, 25,574 images from the dataset were used as the training set, and 2000 as the validation set. The FCN model was implemented and fine-tuned using PyTorch.

Table 4 Classification of Semantic Segmentation Elements

After deduplication and cleaning, the photos from online reviews of Suzhou and Kyoto were organized into two separate folders. Custom code was used to read the photo data, which was then input into the FCN to generate semantic segmentation results. These results allowed us to calculate the proportion of landscape elements favored by visitors, providing the basis for subsequent statistical analyses (Fig. 3).

Fig. 3
figure 3

FCN flowchart of this study. FCN full convolutional neural network.

For text data, we employed Bidirectional Encoder Representations from Transformers (BERT) as our primary tool for NLP. BERT is a transformer-based bidirectional encoder designed to pre-train deep representations from unlabeled text by capturing contextual information7. In our approach, a pre-trained BERT model was fine-tuned on a large-scale corpus to generate word vectors that capture rich semantic information (Fig. 4). However, because the resulting word vectors include a vast array of terms, we subsequently removed those deemed irrelevant to our study.

Fig. 4: Overview of BERT pre-training, fine-tuning procedures, and input representation.
figure 4

a Overall pre-training and fine-tuning procedures for BERT. Apart from the output layers, the same architecture is used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different downstream tasks. During fine-tuning, all parameters are updated. [CLS] is a special token added at the beginning of each input example, and [SEP] is a special separator token (e.g., between questions and answers). b BERT input representation. The input embeddings are the sum of token embeddings, segment embeddings, and position embeddings115. This study uses three short sentences as an example.

The text data from online reviews of Suzhou and Kyoto underwent an extensive preprocessing procedure. First, we cleaned the data by removing noise, such as spelling errors and non-target language content. We then used Jieba for tokenization, word stemming, and part-of-speech tagging, followed by stop word removal based on a standard stop word library. The resulting tokens were used to construct vocabulary lists and perform word frequency statistics. Given that the generated word vectors typically span several hundred dimensions, we applied Principal Component Analysis (PCA) to reduce these vectors to two dimensions, facilitating more intuitive clustering (Fig. 5).

Fig. 5
figure 5

Technical framework of text processing.

Next, tokens were screened to form an environmental feature lexicon that aligns with the study’s objectives, guided by K-means clustering outcomes. We created a list of terms to extract multi-sensory and landscape elements (Table 5). Recognizing that CH encompasses many artificial components, we explicitly separated natural elements from artificial ones. For the sensory descriptions—including aspects related to hearing, smell, and taste94—we enriched our vocabulary using resources such as WordNet6,95,96. This class includes nouns that make sounds or smells and adjectives that describe feelings. This sensory category includes nouns that denote sounds or smells as well as adjectives describing sensory experiences. Moreover, due to the indistinct clustering of terms associated with taste-smell and mountain-water-stone, we merged these into a single group. We also observed clusters that conveyed overall affective impressions; although these clusters encompass multiple sensory inputs and do not specify individual senses, clear positive and negative groupings emerged, which we subsequently categorized as positive and negative feelings.

Table 5 Environmental Feature Lexicon Used for Text Analysis

Statistical analysis

After the DL processing of photos and text, the resulting data were compiled and organized using Microsoft Excel. Subsequent statistical analyses were performed in Python using libraries such as Statsmodels and SciPy for quantitative assessments, while Plotly was employed for data visualization.

For significant difference analysis, independent sample t-tests were conducted to compare tourists’ sensory experiences and the prevalence of landscape elements between the two cities, with a significance threshold set at p < 0.05.

Furthermore, multiple linear regression analyses were performed using various landscape elements and sensory factors (extracted from both text and photos) as predictor variables. This approach aimed to determine the extent to which these factors influence overall visitor perceptions.

To investigate the relationships between landscape elements and sensory experiences, Pearson correlation analysis was used. This method allowed us to quantify the degree of association between variables, again with statistical significance defined at p < 0.05.

Results

Public overall perception

In our study, we employed the Google Cloud Natural Language platform for text sentiment analysis97. This platform processes input text through contextual and syntactic analysis to compute a sentiment score—which typically ranges from −1 (indicating strongly negative sentiment) to 1 (indicating strongly positive sentiment), with 0 representing neutrality—and then normalizes these scores to a 0 to 1 scale. Higher values thus correspond to more positive sentiment. In addition, the platform calculates a sentiment magnitude, reflecting the overall strength or intensity of the emotional content irrespective of its polarity.

Emotion recognition is a fundamental capability in modern natural language processing and human-computer interaction. General-purpose large models, such as ChatGPT, rely on understanding and processing emotional cues as part of their core functionality. These models are trained on vast amounts of data that encompass not only information but also the subtleties of human emotion, enabling them to interpret sentiment, tone, and context effectively. This capability underpins the generation of contextually appropriate and empathetic responses, thereby enhancing human-AI interactions. In essence, emotion recognition is not only critical for specialized tasks like sentiment analysis but also forms a vital foundation for the broader understanding mechanisms in state-of-the-art large language models.

By integrating these advanced emotion recognition capabilities, the Google Cloud Natural Language platform provides a robust means to convert qualitative language descriptions from online reviews into quantifiable sentiment metrics. These metrics serve as reliable proxies for the public’s overall perception of CH sites in Suzhou and Kyoto, thereby offering valuable insights for our study.

As illustrated in Fig. 6 and Table 6, the public’s overall perception status of the CH sites in the two cities is as follows: both received a relatively positive overall perception level. Suzhou received a 66.90% positive score, 19.23% neutral score, and 13.19% negative score, with an average score of 0.829. Kyoto received a 66.76% positive score, 13.54% neutral score, and 11.45% negative score, with an average score of 0.834.

Fig. 6
figure 6

Overall Sentiment Score Distribution. Sentiment score distribution of online reviews for Suzhou (N = 37,649) and Kyoto (N = 17,485).

Table 6 Distribution of Different Sentiment Scores

Photo data analysis

Using the semantic segmentation elements provided by the ADE_20K dataset, the photo data from online reviews was processed through training. Following FCN processing, as depicted in Fig. 3, the landscape element factor—defined as the proportion of a specific landscape element within an entire photo—was extracted for both Suzhou and Kyoto, with the results illustrated in Fig. 7.

Fig. 7
figure 7

Proportion of landscape elements in photos.

Analysis of the photos uploaded by visitors reveals that the most frequently occurring landscape elements in Suzhou include architecture, vegetation, sky, water, and interior elements. In contrast, Kyoto’s dominant landscape elements consist of vegetation, sky, buildings, roads, interior elements, and rock formations. This suggests that architecture plays a more prominent role in Suzhou’s visual perception, whereas vegetation is more significant in Kyoto. Additionally, the proportion of the sky element in Kyoto is greater than that in Suzhou, indicating a higher degree of visual openness in Kyoto’s landscape.

Furthermore, the proportion of water elements in Suzhou’s images surpasses that in Kyoto’s, highlighting the integral role of water in Suzhou’s CH. This finding underscores the distinct characteristics of each city and the different ways in which visitors perceive and capture their surroundings.

Figure 8 presents the radar map depicting the landscape element factors of various CH in Suzhou and Kyoto. The visualization reveals that even within the same city, different CH sites exhibit distinct landscape characteristics. For instance, in Suzhou, S17 features the highest proportion of architectural elements, S8 is dominated by vegetation, S10 stands out for its water and sky elements, and S3 has the most significant interior representation. Meanwhile, in Kyoto, K5 is characterized by abundant vegetation, K4 by a prominent sky element, K11, K13, and K16 by striking architectural features, and K8 by its well-defined road network, all of which are particularly appealing to tourists.

Fig. 8: Radar chart of selected landscape elements.
figure 8

a Suzhou. b Kyoto.

Text data analysis

We extracted high-frequency words, including nouns, adjectives, and verbs, from the text data of online reviews. After processing these words using the NLP workflow illustrated in Fig. 5 and eliminating irrelevant clusters, we identified key feature word clusters for each city. These clusters represent potential characteristic factors influencing overall perceptions and encompass sensory elements, landscape features, and descriptions of experiential mood in Fig. 9.

Fig. 9
figure 9

Expression of different elements in Suzhou and Kyoto.

To systematically analyze these perceptual clusters, we categorized them into three groups: natural elements, artificial elements, and sensory descriptions. As an intermediate step, we visualized these clusters within a two-dimensional coordinate system to refine the feature words and construct a structured word list. The feature list was developed using cosine similarity calculations, incorporating words with similarity values greater than 0.5 into the lexicon. A second layer of vocabulary classification was then conducted based on the initial classification framework, as illustrated in Table 3 and Fig. 5. Using this refined high-frequency vocabulary, we generated a chart map Fig. 8 to further illustrate the relationships among key terms.

Based on these results, we observed that in the sensory descriptions of Suzhou and Kyoto, landscape elements that evoke sensory responses are predominantly represented by nouns, whereas sensory perceptions and actions are mainly expressed through adjectives and verbs. While these terms are distributed across multiple clusters, their patterns remain discernible. Among the senses, vision is the most frequently described, particularly in relation to color. Consistent with prior photo analysis, both artificial and natural elements—such as buildings and vegetation—exhibit distinct clustering in terms of their visual impact. The second most frequently mentioned sense is auditory perception, which is often associated with sounds from birds, music, human voices, or an overall sense of tranquility.

In contrast, olfactory and gustatory descriptions frequently co-occur, leading us to analyze these two senses together. This clustering reveals strong associations with flowers and food, which may correspond to smell and taste, respectively. Descriptions of touch primarily relate to the perception of natural conditions, such as temperature. Additionally, the expressions of positive and negative experiential moods show distinct clustering patterns. Given that sensory experiences are inherently multisensory, we classified these mood-related descriptions within the broader framework of sensory perception.

Statistical analysis

Although Suzhou and Kyoto share an Asian cultural background, the architecture and landscape styles of their CH sites exhibit strong similarities. However, visitors’ descriptions of landscape elements and sensory experiences reveal significant differences (Table 7 and Fig. 10), as reflected in both photographic and textual data.

Fig. 10
figure 10

Differences in elements and perceptions in online reviews of the two cities.

Table 7 Analysis of Significant Differences in Subjective Perception Between Suzhou and Kyoto

In the photo data, the most pronounced differences were observed in water (t = −83.94, p < 0.01) and architecture (t = −52.83, p < 0.01), both of which were significantly more prominent in Suzhou. This result aligns with Suzhou’s historical identity as a water town, where traditional gardens integrate pavilions, bridges, and canals as core elements of the landscape. In contrast, sky (t = 64.70, p < 0.01) and roads (t = 53.33, p < 0.01) were significantly more emphasized in Kyoto, likely reflecting the prevalence of open landscapes, Karesansui (dry rock gardens), and structured pathways in Japanese cultural sites. These findings indicate that Kyoto’s landscapes emphasize expansive, open spaces, whereas Suzhou’s CH landscape is more architecturally enclosed, and water integrated.

In the text data, significant differences were also observed in the frequency of descriptions of landscape elements and sensory experiences (p < 0.01), demonstrating distinct visitor perceptions of tangible landscape features and the sensory experiences they evoke. Among these, the most pronounced difference was found in cost perception (t = −47.963, p < 0.01), suggesting that discussions related to pricing and expenses were notably more frequent in Suzhou. Kyoto visitors, on the other hand, more frequently described flower-related elements (t = 43.838, p < 0.01) and seasonal aspects (t = 34.654, p < 0.01), likely due to the city’s well-known seasonal aesthetics, such as cherry blossoms and autumn foliage. Additionally, Kyoto’s CH sites descriptions emphasized festival activities (t = 33.897, p < 0.01) and vision-related experiences (t = 23.514, p < 0.01), whereas Suzhou’s text data exhibited a stronger focus on landscape structure (t = −16.271, p < 0.01) and natural elements such as water, mountains, and stones (t = −14.497, p < 0.01).

The most significant differences in textual descriptions relate to the cost and fees. While Suzhou exhibits a greater volume of sensory descriptions, Kyoto’s depictions of seasonal flowers make it particularly appealing.

To understand the factors influencing overall perception in Suzhou and Kyoto, we conducted multiple linear regression analyses using landscape elements, artificial features, multi-sensory experiences, and photographic data as predictors (Table 8 and Fig. 11). To mitigate multicollinearity, we constructed separate regression models for each predictor category. The models demonstrated varying levels of explanatory power, with traditional R² values of 64.1% for Suzhou and 63.9% for Kyoto, while Pseudo R² values ranged from 0.22 to 0.36 in Suzhou and from 0.27 to 0.36 in Kyoto, depending on the predictor category. Notably, photo-based predictors explained 21.4% of the variance in Suzhou and 25.3% in Kyoto, indicating that photographic representations contribute moderately to visitor perceptions of CH sites.

Fig. 11
figure 11

Influence of landscape elements and senses on overall perception.

Table 8 Summary of multiple linear regression models predicting overall perception in Suzhou (N = 20) and Kyoto (N = 20)

To ensure the robustness of the regression models, a Kolmogorov-Smirnov (K-S) test was conducted to examine the normality of the residuals. The results confirmed that the residuals followed a normal distribution (p > 0.05), satisfying the assumption of normality. Additionally, a variance inflation factor (VIF) test was performed to assess multicollinearity, with all VIF values below 5, indicating the absence of severe collinearity among predictors.

Among text-based natural elements, greening had the strongest positive effect in Suzhou (β = 0.3466) and a smaller effect in Kyoto (β = 0.1062), suggesting that vegetation contributes more prominently to visitor perceptions in Suzhou. Seasonal descriptions positively influenced perception in both cities (Suzhou: β = 0.0755, Kyoto: β = 0.0942), reinforcing the importance of seasonal aesthetics in both cultural landscapes. However, descriptions of water, mountains, and stones negatively influenced perception in Suzhou (β = −0.032) but positively in Kyoto (β = 0.0779), indicating that while Kyoto’s visitors associate these elements with a more favorable experience, Suzhou’s visitors may perceive them as less influential or even unfavorable. The overall explanatory power of these natural elements was higher in Kyoto (Pseudo R² = 0.36) than in Suzhou (Pseudo R² = 0.22), suggesting that text-based natural features play a more central role in shaping Kyoto’s visitor perceptions.

The influence of artificial elements varied across the two cities. Architecture had a positive effect in both Suzhou (β = 0.1366) and Kyoto (β = 0.1162), indicating its consistent role in shaping visitor perceptions. However, cultural facilities had a strong negative effect in Suzhou (β = −0.2453) but little influence in Kyoto (β = −0.0115), suggesting that built heritage is perceived differently between the two cities. Service and cost-related factors negatively influenced perception in both cities, with stronger effects in Suzhou (Service: β = −0.2424, Cost: β = -0.0898) than in Kyoto (Service: β = −0.0886, Cost: β = −0.0812). This result suggests that service quality and expenses play a more critical role in Suzhou’s visitor satisfaction than in Kyoto. The model for artificial elements explained 30% of the variance in Suzhou (Pseudo R² = 0.30) and 27% in Kyoto (Pseudo R² = 0.27), indicating similar explanatory power across both cities.

Among sensory modalities, visual perception was the strongest predictor of overall perception in Suzhou (β = 0.4069), whereas its effect in Kyoto was much weaker (β = 0.1070). This suggests that Suzhou’s visitors rely more on visual aesthetics in evaluating their experience. Auditory perception had a positive effect in Kyoto (β = 0.1571) but a negative effect in Suzhou (β = −0.0635), indicating that Kyoto’s soundscape, potentially influenced by cultural elements like temple bells and ambient music, enhances visitor perception. In contrast, olfactory and taste-related experiences had a negative effect in Suzhou (β = −0.1286) but a minor positive effect in Kyoto (β = 0.0448), implying that Kyoto’s food and scent-related experiences contribute more positively to overall perception. The explanatory power of the multi-sensory model was slightly higher in Suzhou (Pseudo R² = 0.29) than in Kyoto (Pseudo R² = 0.253), reinforcing the stronger role of visual factors in Suzhou’s visitor experience.

In the photo-based regression model, transportation infrastructure had the most substantial effect in both cities, positively influencing perception in Suzhou (β = 0.4476) but negatively in Kyoto (β = −1.1192). This stark contrast suggests that transportation access enhances visitor perception in Suzhou but may be seen as disruptive or overwhelming in Kyoto. The presence of water in photos had a positive effect in Suzhou (β = 0.0583) but a significant negative effect in Kyoto (β = −0.2333), further supporting the idea that water is a defining and favorable feature in Suzhou’s landscape but less valued in Kyoto’s cultural setting. Similarly, structural elements positively influenced perception in Suzhou (β = 0.1164) but negatively in Kyoto (β = −0.3388), suggesting different visitor responses to the built environments in the two cities. The explanatory power of the photo-based model was similar in both cities (Suzhou: Pseudo R² = 0.214, Kyoto: Pseudo R² = 0.253), indicating that photographic elements play a comparable role in shaping visitor perceptions.

The analysis of landscape elements in photos not only provides the subjective preferences of visitors but also helps to objectively understand the situation of each cultural landscape entity. From the correlation heat map between landscape elements (Fig. 12), both the architectural elements of Suzhou and Kyoto have a negative correlation with vegetation elements, followed by the sky. The sky elements in Suzhou have a positive correlation with water, while Kyoto’s water environment is less related to it.

Fig. 12: Correlation heat map.
figure 12

a Suzhou. b Kyoto.

The correlation analysis of the description of landscape elements and senses in the text demonstrates that the CH’s visual perception in the two cities has a positive correlation with most of the natural and artificial landscape elements. The positive correlation between visual perception and architecture is stronger in Kyoto. Overall, the artificial (people, space) and natural (seasons, mountains and waters, greenery) elements are more relevant to visual perception in Kyoto, while in Suzhou, non-visual senses are more relevant to other landscape elements.

Regarding other sensory descriptions, different sensory experiences in Suzhou are positively correlated with various landscape elements. The description of hearing is positively correlated with landscape stone, natural phenomena, people, cultural facilities, landscape structures, space, and activity festivals in Kyoto. In Suzhou, artificial elements (structures, events-festivals, space) and natural elements (landscape stone, natural phenomena, flowers) demonstrate a positive correlation with smell-taste. Artificial (cultural facilities, structures) and natural (landscape stone, natural phenomena) elements correlate strongly with somatosensory descriptions.

However, in general, although the natural and artificial elements in the CH sites of Suzhou can cause more multi-sensory descriptions, the description of the five senses with positive experiences and the mood in Kyoto is more relevant.

Discussion

This study explores how multi-sensory perceptions influence cultural landscape experiences by analyzing large-scale online reviews of CH sites in Suzhou and Kyoto. To systematically address the key research questions, we interpret the findings from CH perception analysis, correlation analysis, and regression models in relation to multi-sensory experiences, visitor perceptions, and sustainable conservation strategies.

This study highlights both the specificity and universality of CH sites in sustainable urban development research98,99. Visitors’ subjective perceptions are valuable for CH conservation policies, and with the rise of online reviews, local heritage values and landscape characteristics are increasingly reflected in digital discourse. These reviews contain implicit evaluations that are widely recognized for their policymaking significance in CH protection and sustainable development100.

To analyze public perceptions, FCN and NLP offer reliable support101 and are widely applied in commercial strategy and data analysis102. However, few studies have integrated them into CH research, and sensory information in text analysis is often overlooked. This study addresses these gaps by proposing a new DL-based method for processing online reviews, demonstrating its potential for multi-sensory CH analysis and highlighting both its advantages and limitations.

To begin with, one particular aspect of cultural heritage (CH) protection lies in its unique contextual and historical characteristics.CH serves as a carrier of local culture, shaped by aesthetic, cultural, and environmental interactions7. While CH sites in Eastern cultural cities often share similar architectural and landscape styles, this study reveals significant differences in visitors’ sensory experiences and perceptions of landscape elements, challenging the assumption of uniform subjective impressions.

Photo analysis reveals that while architecture, vegetation, and sky are the most perceived elements in both cities, their distribution differs. Kyoto’s open landscapes emphasize visual spaciousness, whereas Suzhou’s compact design follows the “heaven and earth in a pot” concept, creating diverse spatial changes in a confined area, leading to differences in landscape openness.

Suzhou’s CH, originating from the Ming and Qing dynasties103, developed under abundant rainfall and an intricate water system73, making it ideal for garden construction. Many banished officials built private gardens as a retreat, incorporating rockeries and vegetation to create dynamic spatial transformations that simulated nature in confined spaces104,105. While symbolizing withdrawal from secular life, the architectural elements, plaques, and couplets still reflected their political aspirations. In contrast, public gardens—including temples and ancient streets along river systems—featured buildings, sky, and water as dominant landscape elements, offering a more open and accessible environment.

Kyoto’s CH, though influenced by China’s Sui and Tang Dynasties, evolved under Japanese geographical and cultural conditions, resulting in a distinct philosophy from Suzhou106. It preserves buildings and gardens from multiple historical periods, including temples and shogunate-era private gardens. Garden designs often simulate the sea using water or white sand and represent sacred mountains with stones, reflecting Zen and tea ceremony traditions. Unlike Suzhou, water is not an essential element in Kyoto’s courtyards; instead, vegetation, sky, sand, and gravel paths dominate, creating a minimalist, open landscape that emulates nature.

The significance analysis of textual differences indicates that visitor descriptions of cultural heritage in Suzhou are more likely to focus on cost and expenses, which may reflect a sensitivity to economic considerations. Additionally, Suzhou exhibits a higher frequency of sensory descriptions. In contrast, Kyoto features significantly more references to seasonal events and floral activities, emphasizing the dynamic and aesthetically rich experiences shaped by seasonal changes.

Subjective perceptions of CH vary due to differences in landscape elements, management practices, and cultural experience activities. Therefore, CH conservation should adopt customized strategies that consider local history, natural environment, and existing built heritage, rather than applying uniform preservation methods. This approach aligns with findings from previous research13,107.

In terms of overall perception, the multiple linear regression analysis highlights the significant role of multi-sensory experiences in shaping visitor satisfaction. Both multi-sensory engagement and visual aesthetics contribute positively to the overall perception of CH sites in Suzhou and Kyoto, reinforcing the importance of preserving architectural and natural elements.

The historical and cultural depth of CH in Suzhou and Kyoto has long been associated with multi-sensory engagement, as evidenced by ancient poems and artworks describing experiences such as listening to the rain under plantain trees, hearing temple bells at dusk, meditating in Karesansui landscapes, inhaling the fragrance of flowers, or touching natural elements like water and tree bark for restoration108. However, in modern tourism-driven experiences, rapid visits may have weakened non-visual sensory engagement, leading to a diminished role of hearing, smell, and touch in visitor perception.

The role of visual elements is particularly significant, with vegetation playing a key role in shaping overall evaluations. This aligns with research on urban green spaces, which highlights the positive psychological effects of natural scenery47,109. Some studies suggest that virtual reality (VR) or photographic representations of landscapes can replicate the relaxing effects of nature, but whether such technologies can fully substitute real-world CH landscapes remains an open question110. Beyond vision, non-visual sensory factors also play a role in perception, albeit with mixed effects. While they cannot be captured in images, they are evident in visitor descriptions111.

In Suzhou, hearing, smell-taste, and somatosensory factors negatively influence visitor perception, suggesting that issues such as noise, air quality, and thermal comfort require further improvement to enhance satisfaction. In Kyoto, the higher frequency of floral and seasonal terms suggests that flower-related experiences and seasonal aesthetics play a stronger role in shaping perceptions. The significance analysis of textual differences indicates that visitor descriptions of cultural heritage in Suzhou are more likely to focus on cost and expenses, which may reflect a sensitivity to economic considerations. Additionally, Suzhou exhibits a higher frequency of sensory descriptions. In contrast, Kyoto features significantly more references to seasonal events and floral activities, emphasizing the dynamic and aesthetically rich experiences shaped by seasonal changes. Although Kyoto has fewer sensory descriptions than Suzhou, non-visual sensory elements contribute more positively to the overall perception. At the same time, issues related to accessibility contribute the most to negative perceptions, it is essential to optimize the accessibility of various sites, reduce travel costs, and enhance accessibility equity.

In both cities, service quality and cost-related concerns negatively predict overall visitor perception, highlighting the importance of balancing accessibility, pricing, and visitor expectations. As CH sites function as public resources, facility management—including visitor capacity control, parking infrastructure, public rest areas, and maintenance—plays a crucial role in ensuring visitor satisfaction. High-quality multisensory experiences contribute to enhancing overall perception. Suzhou should focus on optimizing non-visual aspects to enrich sensory diversity and improve the visitor experience. In contrast, Kyoto has already gained stronger overall perception support through intangible sensory experiences. Future improvements could emphasize enhancing the richness of visual elements.

Based on the correlation analysis, different sensory experiences are closely linked to specific landscape elements. To preserve tangible CH, maintaining visual experiences is crucial, as it guides cultural perception and visitor engagement. Architectural landmarks and culturally significant vegetation enhance visitor interest, while the integration of natural landscapes and artificial structures strengthens visual sensory engagement.

Beyond vision, other senses also shape cultural perception, often influenced by landscape features or cultural activities. Regarding hearing, natural sounds such as birdsong and water flow are essential in shaping visitor experiences, reinforcing the perception of tranquility. However, urban noise pollution negatively impacts visitor evaluations, as confirmed by previous studies112,113.

To counter this, it is important to curate soundscapes by incorporating traditional auditory elements such as temple bells or cultural music, enhancing the cultural experience and auditory perception.

Similarly, olfactory and taste experiences are strongly tied to flowers and traditional cuisine. Seasonal flower-viewing festivals, historically significant in both Suzhou and Kyoto, remain influential in enhancing sensory engagement. These experiences stimulate multiple senses simultaneously, particularly through floral scents and birdsong. The sense of touch, primarily related to temperature and material interaction, is also a frequently mentioned factor in visitor reviews. Intangible cultural activities, such as opera appreciation, offer a comprehensive sensory experience, combining visual, auditory, and tactile engagement with CH landscapes.

The excavation and preservation of sensory experiences play a key role in strengthening subjective conservation awareness among residents and visitors. Ultimately, both tangible landscape elements and intangible cultural activities contribute to multi-sensory engagement, positively impacting visitor perception and heritage conservation114.

This study highlights the importance of integrating multisensory elements into heritage preservation, with non-visual experiences potentially exerting an even greater impact on overall perception. Mapping tourist perceptions to specific locations enables targeted optimization, and findings confirm that high-quality sensory experiences consistently enhance overall satisfaction.

However, limitations remain. Current multimodal methods still struggle to effectively align emotional information between images and text in online reviews, due to variability in visual content and the complexity of cross-modal semantics. Additionally, image segmentation lacks consistency compared to structured data sources, and the incomplete sensory lexicon may lead to omissions in textual analysis.

With rapid advancements in technology, especially in multimodal deep learning and large-scale language–vision models, future research is expected to better address the alignment between images and text. These cutting-edge technologies can more accurately understand and integrate complex semantic relationships across visual and textual data, enhancing sensory simulation and emotion-aware analysis. For example, when analyzing online reviews of cultural heritage sites, it will be possible not only to detect emotions in the text but also to combine photos, videos, and other media uploaded by visitors to deeply capture their emotional responses and experiences related to intangible cultural elements. This will offer a more holistic perspective for cultural heritage protection, fostering both sustainability and differentiated conservation strategies.