Introduction

Depression is a leading cause of disability among older adults, yet the world’s aging population continues to have limited depression screening and prevention resources1. Brief and scalable screening tools indicating which older adults are at risk for future depression, followed by interventions targeting depression-specific pathways, may be beneficial. Self-reported sleep characteristics could contribute meaningfully to such tools and interventions. Up to 90% of people with major depressive disorder report sleep-related symptoms2 and studies link sleep to underlying depression pathophysiology3. Although sleep is known to have bi-directional associations with depression, experimental evidence suggests a potential causal relationship3. Among healthy individuals, experimental manipulation of sleep increases negative affect4, and among people with depression, clinical trials demonstrate that therapeutic sleep interventions reduce depressive symptoms5.

Despite the strong links between sleep and depression, the multi-dimensional nature of sleep is a barrier to determining which self-reported sleep features should be considered in future investigations aimed at enhancing screening and prevention tools. Insomnia symptoms and diagnoses can more than double the risk for future depression3,6. The ‘insomnia with short sleep’ phenotype is reported to have a stronger association with depression than insomnia alone7, although often this phenotype is defined based on objective (rather than self-reported) sleep duration. Sleep apnea frequently co-occurs with depression and may also triple the risk of future depression8. Studies have also considered whether ‘sleep health’ characteristics (e.g., Satisfaction, Alertness/Sleepiness, Timing, Efficiency, and Duration; abbreviated ‘SATED’), either individually or in a composite score, predict depressive symptoms in older adults. For example, in older women, extreme daytime sleepiness was the strongest individual sleep health characteristic and doubled the risk for depression, but reporting ≥ 3 adverse sleep health characteristics tripled the risk for future depression9.

Another challenge in the study of sleep and depression—and health research more broadly—is that extant findings are based on studies with different populations, obscuring our ability to evaluate which findings are generalizable versus population-specific. Rates and predictors of incident depression may differ by demographic and health factors10; however, potential confounders are not consistently measured or incorporated in analyses performed across different studies. Additional sources of heterogeneity include different analytic approaches and small sample sizes. Analyses that aggregate data at the person level, rather than the study level, address many of these limitations: they allow for covariation of important individual differences, standardize analytic procedures, and boost power for moderation analysis needed to examine differential associations11.

In the current study, we analyzed data from a harmonized multi-cohort United States sample (‘US Sample’) and, separately, from the Rotterdam Study from The Netherlands (‘RS Sample’). Participants did not have clinically relevant depressive symptoms at an initial visit and were reassessed for depressive symptoms 3–6 years later. Leveraging these samples, our specific aims were to: (1) determine which individual self-reported sleep measures have the strongest associations with future clinically relevant depressive symptoms; (2) develop and validate composite sleep scores that are associated with future clinically relevant depressive symptoms; and (3) explore whether findings differ across subgroups stratified by sex, race, age and cohort.

Methods

Cohorts and analytic samples

We harmonized and aggregated four cohorts of older adults from the US: the Study of Osteoporotic Fractures12 (SOF), the Osteoporotic Fractures in Men Study13 (MrOS), the Memory and Aging Study14 (MAP), and the Minority Aging Research Study15 (MARS). We also developed a Rotterdam Study16 (RS) sample, harmonized to have identical variable names and coding as the US sample. RS and US samples were kept separate because current European data sharing laws generally prohibit RS from being directly analyzed within US-based institutions. Our harmonization procedures across these cohorts were reported previously17. Briefly, content experts were given list of all items and then used an iterative process to group them into conceptual domains and subdomains. Within (sub)domains, we selected up to one item from each cohort and then recoded the items to be comparable across cohorts. Finally, experts not involved in the harmonization process rated the perceived harmonizability of the sleep items.

To select the initial visit and accompanying follow-up visit for MrOS, SOF, and RS, we balanced several factors including the follow-up interval length, cohort age, data availability, and harmonizability. These considerations resulted in initial visits and follow-up visits that were 3–6 years apart. Because MAP and MARS have annual assessments, for each participant we selected the initial visit as the first observation year with complete sleep data, and the follow-up visit as the observation year that was closest to, but not beyond, 6 years from the selected initial visit.

Inclusion criteria for the US and RS analytic samples were: (1) age ≥ 60; (2) no known or suspected dementia at the initial visit; (3) complete data on self-report sleep characteristics, depressive symptoms, and antidepressant use at the initial visit (i.e., no missing data related to sleep or the outcomes of interest); (4) complete data for depressive symptoms at the follow-up visit; and (5) no clinically relevant depressive symptoms at the initial visit. All participants provided informed consent and Institutional Review Boards at respective institutions approved each study.

Further cohort information and details of how inclusion/exclusion criteria were defined in each cohort are provided in Supplementary Tables 12. A flow chart depicting sample derivations is provided in Supplementary Fig. 1.

Measures

Outcome

‘Clinically relevant depressive symptoms’ was selected as our primary outcome because it is interpretable, clinically relevant and harmonizable. Three depression scales were used across the cohorts: the 10-item Center for Epidemiologic Studies Depression Scale18 (CESD-10; MAP and MARS), the 20-item CESD19 (RS), and the 15-item Geriatric Depression Scale (GDS)20 (MrOS and SOF). Based on prior studies’ determinations of optimal cutoffs for the diagnosis of major depressive disorder, the presence of clinically relevant depressive symptoms at follow-up was indicated by a GDS score ≥ 620, a CESD-10 score ≥ 421 or a 20-item CESD ≥ 1619.

Self-reported sleep

We selected nine harmonized self-reported sleep health and sleep disorder symptom characteristics that were potentially relevant for future depressive symptoms: ‘sleep quality’, ‘daytime symptoms’, ‘midpoint timing’, ‘sleep efficiency’, ‘sleep duration’, ‘difficulty falling asleep’, ‘difficulty staying asleep’, ‘frequency of snoring’ and ‘frequency of stopping breathing during sleep’. All self-reported sleep characteristics were rigorously harmonized across cohorts and previously judged to have high harmonizability, except ‘daytime symptoms’, which had moderate harmonizability17. The ‘daytime symptoms’ sleep characteristic was more heterogeneously measured, with cohort items inquiring about tiredness and fatigue, difficulty staying awake, and sleep problems hindering daytime activities. Because different cohorts administered these different items, we did not have information on the degree to which participants viewed these various daytime symptoms as being similar; thus, we take a conservative approach and refer to them only as daytime symptoms. For interpretability and comparability, we categorized each sleep characteristic based on potentially adverse levels for older adults. Whenever possible, definitions were based on pre-existing and published cutoffs22. Otherwise, we used the frequency distributions and clinical content to indicate potentially adverse levels of sleep characteristics. See Table 1 for definitions of harmonized sleep measures and their cut-offs and Supplementary Table 3 for the original cohort item wording of each item.

Table 1 Individual and composite sleep features.

We considered five self-reported sleep composite scores derived from the combinations of the nine categorical sleep characteristics: ‘All-Unweighted’, ‘All-Weighted’, ‘SATED’, ‘Selected’, and ‘Insomnia with Short Sleep’. The All-Unweighted and All-Weighted scores incorporate the full set of sleep features, with the former a simple sum of sleep indicators and the latter a sum incorporating weights derived in the other (external) sample (i.e. weights for US score derived in RS, and vice versa). The Selected score was derived to offer a composite score requiring only the sleep items with the strongest associations with depression, selected based model results from on the other sample. SATED and Insomnia with Short Sleep are composite sleep indices frequently considered in sleep literature7,9. Composite score definitions and derivations are provided in Table 1, with technical details in Supplementary Text 1 and Supplementary Table 5.

Covariates

Education, marital status, cohort (US only), sex, race (US only), age, and follow-up time were considered potential confounders and included in all models. These are denoted as base covariates. Smoking status, alcohol use, body mass index, number of physical health comorbidities (considering stroke, thyroid disease, heart attack/congestive heart failure, hypertension, diabetes), use of sedating medications, and use of non-sedating antidepressant medications were hypothesized to be either confounders or theoretical mediators10. As the design of our study does not allow for formal differentiation of confounders versus mediators, we refrain from distinguishing their role and instead refer to them as secondary covariates.

While a variety of medications can have subtle effects on sleep, we focused on those with well-described and consistent effects on sleep and/or mood when defining specific medications to be included in the ‘sedating’ and ‘non-sedating antidepressant’ categories. Sedating medications included any tricyclic antidepressant, mirtazapine, nefazodone, non-benzodiazepine non-barbituate sedative hypnotic medications, or trazodone, coded from individual lists of drugs. Non-sedating antidepressant medications included any antidepressant except tricyclic antidepressants, mirtazapine, or nefazodone. In the US cohort, medications were collected via visual assessment of medication containers brought to the visit. In the RS cohort, medication use was determined based on pharmacy dispensary records.

Four covariates in the US sample each had up to < 0.5% missing data. Covariate missingness in RS was < 0.5%, except for education (1.37%) and BMI (2.5%). Missing covariate values were imputed using the MissForest package in R. Further details of categorization and measurement of all covariates are provided in Table 2 and Supplementary Table 4.

Table 2 Sample characteristics for the harmonized U.S. sample and the RS sample.

Statistical analyses

Variables in the US and RS cohorts were harmonized so that identical code could be directly applied to each dataset, except for RS code excluding indicators of race and cohort. For all analyses, code was originally developed at the University of Pittsburgh for the US cohort and then sent to Erasmus MC University Medical Center to be run independently on the RS cohort. In preliminary analyses, we used descriptive statistics to assess sleep and covariate distributions in full and stratified samples and explored Spearman correlations among sleep indicators. Across all analyses, our focus was on interpreting effect sizes and 95% confidence intervals for inference. When examining sleep and depression associations, we considered a Risk Ratio (RR) of 1.86 as a guiderail to indicate a potentially moderate effect size23. To reduce family-wise error, we used Benjamini Hochberg multiple comparison corrections across tests within each of our three aims to underscore the most robust findings. R Studio 2023.09.1 was used for all analyses.

Aims 1 and 2: individual sleep characteristics and composite scores

We used generalized linear models with a log link (i.e., Poisson regression) and robust standard errors to regress clinically relevant depressive symptoms at follow-up on each individual sleep feature (Aim 1) and each composite sleep score (Aim 2), adjusting for base covariates. This approach produces RR estimates that are interpretable and robust for low-incidence outcomes. A separate model was fit for each sleep characteristic or score.

For each individual sleep item and composite score, we used contrasts to estimate RRs across the observed range, with the reference group set at zero (optimal sleep health based on that characteristic or score). Given our interest in health screening and reducing false negatives, we examined the sensitivity associated with each model in Aims 1 and 2 plus a model including only base covariates. However, for completeness, we also secondarily report specificity and accuracy. When computing these performance metrics, we indicated a person was ‘positive’ for future clinically relevant depressive symptoms if their predicted probability was > 0.07, the average rate of future clinically relevant depressive symptoms across US and RS samples.

We performed three sets of sensitivity analyses for Aims 1 and 2: (1) allowing for new use of any antidepressant medication to count as incident depressive symptoms, and thus excluding participants using sedating or non-sedating antidepressants at the initial visit; (2) removing the sleep item from the CESD-10 and CESD-20 (the GDS does not include a sleep item); and (3) adding secondary covariates to the models to assess whether effect sizes remained consistent.

Exploratory aim 3: moderation

Using Poisson regression, we explored whether sex, age, race (US only), and cohort (US only) moderated the association of individual and composite sleep measures with future clinically relevant depressive symptoms. These analyses are exploratory because we generally did not have specific hypotheses about which demographic profiles would have better or worse sleep health for each sleep score or sleep characteristic. We also ran models within samples stratified by age, sex, race (US only), and cohort (US only), adjusting for base covariates.

Results

Preliminary descriptive analyses

The US sample is older than the RS sample by about 10 years. In the US sample, 15% of people identify as Black or African American, while RS primarily includes people of Dutch ancestry without self-reported race. Relative to RS, US participants have a shorter median follow-up time and higher prevalences of daytime symptoms, low sleep efficiency, difficulty staying asleep, and stopping breathing during sleep. In contrast, RS participants have greater prevalences of late timing and snoring. We observed low-to-moderate correlations among individual sleep characteristics (median [Q1, Q3] magnitude = 0.07 [0.03, 0.48] in RS; 0.05 [0.02, 0.12] in US). Tables 2 and 3 show full clinical, sociodemographic and sleep characteristics for the US and RS samples. Supplementary Tables 611 include demographic and sleep characteristics stratified by age, sex, race (US only) and cohort (US only).

Table 3 Self-report sleep measures in US sample and stratified samples.

Aims 1 and 2: individual sleep characteristics and composite scores

Figure 1 provides RRs (95% CIs) for all individual sleep characteristics in the RS and US samples. In both the US and RS samples, difficulty falling asleep and daytime symptoms had moderate associations with future clinically relevant depressive symptoms. Poor quality, short duration, and difficulty staying asleep were also associated with future clinically relevant depressive symptoms in both samples, but with smaller effect sizes. In the US sample only, stopping breathing during sleep and late midpoint increased risk for future clinically relevant depressive symptoms.

Fig. 1
figure 1

Risk Ratios (RRs) and 95% confidence intervals for self-reported sleep features in the United States (US) and Rotterdam (RS) samples. Models are adjusted for education, marital status, cohort (US only), sex, race (US only), age, and follow-up time.

All five composite sleep scores were associated with future clinically relevant depressive symptoms after adjusting for multiple comparisons; however, only four of these scores (Selected, SATED, All-Weighted, and All-Unweighted) had observed levels that tripled risk in either the US or RS sample (Fig. 2). Compared to participants endorsing zero items, endorsing two out of two (2/2) items in the Selected score was associated with a RR (95% CI) of 3.32 (2.33, 4.73) in the US sample and 3.19 (2.11, 4.82) in the RS sample. Endorsing 5/5 items on the SATED score was associated with a RR (95% CI) of 3.33 (2.02, 5.49) in the US sample and 2.65 (1.61, 4.38) in the RS sample. Endorsing 7/9 items on the All-Unweighted score was associated with a RR (95% CI) of 4.01 (2.50, 6.43) in the US sample and 3.16 (1.97, 5.06) in the RS sample. A value of 7/9 on the All-Weighted score was associated with a RR (95% CI) of 3.35 (2.26, 4.98) in the US sample and 3.10 (2.05, 4.69) in the RS sample. The percentages of participants endorsing these levels in the US and RS samples are provided in Table 3. The Selected score had the highest sensitivity for predicting future clinically relevant depressive symptoms (0.724 and 0.590 in US and RS, respectively), followed closely by difficulty falling asleep (0.712 and 0.586). Adding either the Selected score or difficulty falling asleep to the base model increased sensitivity by 11–12%. All composite score contrasts and performance results are provided in Supplementary Tables 1213.

Fig. 2
figure 2

Risk ratios (reference group = 0) and 95% confidence intervals across the range of possible values for each composite score in the Rotterdam (RS) and United States (US) samples. All composite scores passed multiple comparison correction. Models are adjusted for education, marital status, cohort (US only), sex, race (US only), age, and follow-up time.

Across all three planned sets of sensitivity analyses, difficulty falling asleep and daytime symptoms remained strong individual sleep characteristics. Poor quality and short sleep duration were also repeatedly identified as predictive individual sleep characteristics. When considering composite scores, the Selected score had the strongest effect size while requiring the fewest items, as endorsing both items consistently produced RRs > 3.0, while other scores required endorsing at least five items to reach the same effect size. Full sensitivity analysis results are provided in Supplementary Figs. 27.

Exploratory aim 3: moderation

The only formal interaction passing multiple comparison corrections was between sex and stopping breathing in the US sample (RR [95% CI] 4.23 [1.95, 9.16] for sex-by-stopping breathing interaction). Accordingly, in stratified US analyses we observed a large effect of stopping breathing during sleep in women (RR [95%CI] 4.77 [2.73, 8.35]) but no association in men (1.23 [0.73, 2.10]). The All-Unweighted composite score (which includes stopping breathing) was slightly stronger among women than men in the US sample, although this interaction was not statistically significant (RR [95% CI] 1.25 [1.14,1.36] in women; 1.19 [1.08, 1.30] in men). In RS, a small effect size for stopping breathing during sleep and future clinically relevant depression symptoms was observed in women (1.52 [0.87, 2.64]) but not in men (0.94 [0.55, 1.61]), although this difference was not statistically significant. See Fig. 3 for RRs (95% CIs) of individual sleep characteristics by sex, and Supplementary Fig. 12 for RRs (95% CIs) of composite sleep scores by sex.

Fig. 3
figure 3

Association of sleep features and clinically relevant depressive symptoms, stratified by sex. The bracket and asterisk indicate a significant interaction term between sex and stopping breathing during sleep. Models are adjusted for education, marital status, cohort (US only), race (US only), age, and follow-up time.

Although we did not observe any statistically significant sleep-by-race interactions, four of the five SATED sleep health components (daytime symptoms, poor quality, short duration, and late midpoint) were associated with future clinically relevant depressive symptoms among White US participants but not among Black US participants. Accordingly, the SATED score was related to future depressive symptoms among White participants (RR [95% CI] 1.33 [1.19, 1.47]) but not among black participants (RR [95% CI] 1.06 [0.85, 1.33]). Conversely, stopping breathing was strongly associated with future clinically relevant depressive symptoms among Black participants (RR [95% CI] 4.19 [1.84, 9.54]) but not among White participants (RR [95% CI] 1.50 [0.93, 2.43]). Plots illustrating results for all stratified analyses are provided in Supplementary Figs. 814.

Discussion

In large samples of non-depressed older adults from the United States and the Netherlands, we investigated nine self-reported sleep features, along with five specific combinations of these features, to determine which have the strongest associations with future clinically relevant depressive symptoms 3–6 years later. Difficulty falling asleep and sleep-related daytime symptoms had the largest effect sizes across samples. Older adults endorsing either of these features had approximately double the risk of future depressive symptoms, while those endorsing both features had approximately triple the risk.

Difficulty falling asleep and sleep-related daytime symptoms have each previously emerged as predictors of depression6. Their combination overlaps with the clinical diagnosis of insomnia, which is indicated by distress related to falling asleep, staying asleep, or early morning awakenings, with resulting daytime impairments. These daytime impairments can manifest heterogeneously and include a wide range of symptoms including fatigue, impaired performance, mood disturbance, reduced motivation, and sleepiness24. Although the two features we identified are components of the insomnia diagnosis, other common insomnia symptoms including sleep efficiency and difficulty staying asleep did not consistently emerge as predictors of future depressive symptoms. The specific combination of difficulty falling asleep and daytime symptoms may signify an insomnia subtype that is experiencing a vicious circle of problems. Difficulty falling asleep at night may result in daytime sleepiness or fatigue, which in turn can reinforce difficulty falling asleep the subsequent night. Circular symptoms such as these may represent a form of conditioned arousal at bedtime that, in combination with sleep difficulty, could lead to affective dysregulation and depression25. This insomnia subtype should be examined in future studies to assess its potential predictive value for other health outcomes and other populations. Interventions such as Trans-C26, which considers a wider spectrum of sleep health problems beyond standard insomnia symptoms, may be beneficial for stopping this vicious sleep cycle and potentially preventing future depression, although further research is warranted.

Our large and heterogenous sample allowed us to examine which findings were most similar or different across demographic strata. Difficulty falling asleep was among the most robust features identified, as it was associated with clinically relevant depressive symptoms across nearly all stratified samples, as well as all three sensitivity analyses. Conversely, stopping breathing during sleep showed much stronger associations with future depressive symptoms among women and Black participants than among men and White participants. Although the gold standard for diagnosis of sleep apnea is polysomnography, stopping breathing during sleep is a frequent clinical complaint related to sleep apnea. Previous studies have suggested that women with sleep apnea are at higher risk of future depression27, but few studies have directly examined the role of race in the relationship between sleep apnea and depression in older adults. However, among Black older adults, sleep apnea enhances vulnerability to other adverse health outcomes, including stroke28,29. As we do not consistently have sleep apnea characterized with polysomnography across cohorts, we could not examine this relationship in more detail; however, such analyses will be an important future direction.

Our investigation of composite sleep scores suggested that the combination of daytime symptoms and difficulty falling asleep may be an efficient and scalable way to identify people at risk for depression, especially in the US sample. Adding seven additional sleep features (for a total of nine features in the All-Unweighted score) provided some added benefit in effect sizes for those with very high scores, although in our sample ≤ 1% endorsed these higher levels. In a sample with poorer sleep health, it is possible that there may be a greater benefit to adding these additional questions. It is noteworthy that the weighted version of the overall score did not perform better than a simple sum of the items. However, studies with other outcomes, features and samples may reach different conclusions regarding the value of weighted versus unweighted composite scores. It is also likely that using machine learning models accounting for complex interactions among features could further enhance effect sizes, as we have shown previously using random forests30.

Major strengths of our study include the large sample sizes, analyses incorporating multiple demographically distinct cohorts, self-reported items that can be incorporated into clinical care, comprehensive variable harmonization, rigorous external validation to enhance generalizability outside the specific data considered here, and careful adjustment for potential confounders. Limitations are related to the observational nature of the study and some differences in the wording of the ‘daytime symptoms’ item across US cohorts and RS study, with cohort items differentially emphasizing energy, enthusiasm, or fatigue. This heterogeneity in ‘daytime symptoms’ items presents a challenge to the field of sleep more broadly, extending beyond our own work17. Along these same lines, different cohorts used different depression scales, limiting our ability to create a rigorous and harmonized continuous depression measure. Instead, we used a dichotomous depression measure, which is less sensitive than a continuous measure and which precluded the ability to adjust for subthreshold levels of depression at the initial sleep visit. Despite these cohort differences, findings were relatively consistent across cohorts. Finally, although our sample does not include people with clinically relevant depressive symptoms, we cannot rule out the possibility that the sleep characteristics we identified represent a prodromal or subthreshold form of depression.

Our work lays the foundation for several future directions. Investigating the mechanisms linking difficulty falling asleep and daytime symptoms with future depression is an important next step. Plausible pathophysiological pathways include inflammatory cytokines, neural circuit dysfunction, and neurochemical pathways related to monoamines, orexins and other neurotransmitters3. Further characterizing older adults based on actigraphy and polysomnography may add depth to our findings, as these objective measures inherently quantify different aspects of sleep than subjective self-report. However, self-reported sleep in particular may provide useful insight into depression risk precisely because its subjective nature reflects subtle aspects of sleep that cannot be detected with physiological and behavioral measures31. For instance, self-report can reflect affective and cognitive arousal in a way that wrist movements do not. Finally, future studies must assess whether early identification (e.g., using the combination difficulty falling asleep and daytime symptoms), followed by sleep treatment to address these challenges, can reduce the risk of future depression onset in older adults.