Background & Summary

TAR DNA-binding protein 43 (TDP-43) is a ribonucleoprotein implicated in several neurodegenerative disorders (NDDs)1. Primarily localized in the nucleus under physiological conditions, TDP-43 is a critical regulator of RNA transcription, splicing, stability, and metabolism2,3,4,5. However, in several NDDs, TDP-43 undergoes a pathological shift in its localization to the cytoplasm6,7,8,9. Cytoplasmic aggregates of TDP-43, which can be phosphorylated and ubiquitinated, are observed in the brain and spinal cord in various NDDs, such as amyotrophic lateral sclerosis (ALS), frontotemporal lobar degeneration (FTLD), Alzheimer’s disease (AD) and Parkinson disease8,10. Notably, roughly 97% of ALS and 45% of FTLD patients show neuronal and glial inclusions containing TDP-43 with significant nuclear clearance of the protein5.

While emerging evidence supports a role of TDP-43 in RNA processes, understanding the mechanisms by which it contributes to neurodegeneration is complex. The TARDBP gene, responsible for encoding the TDP-43 protein, harbours over 50 pathogenic mutations associated with both sporadic and familial forms of ALS and FTLD4,11,12. These mutations predominantly cluster in the amyloidogenic C-terminal glycine-rich region of the protein, rendering TDP-43 prone to forming amyloid-like fibrils thus enhancing its propensity for aggregation5. One such mutation commonly found in ALS patients is the TDP-43M337V mutation where Methionine 337 is substituted by Valine13. This mutation promotes TDP-43 cytoplasmic mislocalization and disrupts TDP-43 regulated RNA processes ultimately leading to widespread transcriptomic dysregulation5,12,14. Notably, several of these changes overlap with those observed in TDP-43 knockdown models, suggesting a shared mechanism of RNA dysregulation5,11.

While the evidence indicates broad transcriptomic changes in various neuronal models after TDP-43 depletion and mutant expression, a parallel comparison of TDP-43 depletion and mutant conditions in the same cellular model is unprecedented. Furthermore, most transcriptomic studies in TDP-43 models have traditionally focused on either the coding or non-coding regions of the transcriptome, with limited consideration of how changes in one affect the other. Here, employing a library preparation method that captures both coding and non-coding fractions of RNA, we have generated comprehensive transcriptomic datasets following TDP-43 knockdown and mutant TDP-43M337V expression, as well as their respective controls, in NSC34 motor neuron-like cells. These datasets are a valuable resource for elucidating how TDP-43 loss of function or mutation can perturb the coding and non-coding transcriptome in a commonly used cellular model of motor neurons. Moreover, it facilitates the comprehension of how alterations in the non-coding transcriptome impact the coding transcriptome, and vice versa.

Methods

Cell culture

NSC34 motor neuron-like cells (Tebubio: CLU140-A) were cultured in DMEM culture medium (Sigma-Aldrich: D5796) supplemented with 10% Serum (Gibco: 10500064) and 1% Penicillin-Streptomycin (Gibco: 15140122) at 37 °C in a humidified incubator with 5% CO2. Twenty-four hours before transfection, cells were seeded to a confluency of 60% in a 6-well plate.

TDP-43 Knockdown in NSC34 motor neurons

TDP-43 expression was depleted in NSC34 motor neurons using RNA interference (Fig. 1a). NSC34 motor neurons were transfected with two TDP-43 siRNAs: Mm_Tardp 1 siRNA (AACGATGAACCCATTGAAATA, Qiagen: SI01441503|S2) and Mm_Tardp 2 siRNA (CAGTAACATGGTAACATTAAA, Qiagen: SI01441510|S2) as the TDP-43 knockdown group (TDP-43-KD, n = 8). The non-silencing (NS) siRNA (Qiagen: 1027281) was used as a control (n = 8). On the day of transfection, Lipofectamine RNAiMAX reagent(Invitrogen™:13778075) and siRNAs were individually diluted in Opti-MEM® Medium. These diluted components were then combined at a 1:1 ratio and incubated for 10 minutes to form the siRNA-lipid complex. The resulting complex was added to fresh media to achieve a final siRNA concentration of 20 nM. Subsequently, cells were incubated under standard conditions for 6 hours, after which their medium was replaced with fresh growth medium. Cells were then further incubated under standard conditions for up to 24 hours before proceeding to RNA extraction.

Fig. 1
figure 1

RNA-Seq workflow and analysis pipeline. (a) Steps for RNA-seq library preparation, including RNA extraction, quality checks, and library construction for both experimental settings: TDP-43 Knockdown (KD) with its respective control (NS), and TDP-43 mutant expression (MVT) with its respective control (WTT). (b) Bioinformatics pipeline for processing raw FASTQ data, from quality trimming to gene expression quantification, including the software tools used. (c) Workflow for read normalization, sample quality control, and gene biotype annotation as reported in this manuscript.

TDP-43 wildtype and mutant synthesis and plasmid expression in NSC34 motor neurons

The open reading frame of wild-type TDP-43 and mutant TDP-43M337V were synthesized and cloned into the Express Cloning Vector pcDNA3.1 + C-eGFP generating the Cloning vectors: TDP-43_WTT_pcDNA3.1( + )-C-eGFP plasmid (Genescript: SC1691) and TDP-43_MVT_pcDNA3.1( + )-C-eGFP (Genescript: SC1693), from here on referred to as WTT (Wild type TDP-43), and MVT (mutant TDP-43), respectively. Following synthesis and verification, WTT and MVT plasmids were transformed into E. coli and grown overnight. Three single colonies for each plasmid were selected, cultured in LB media with ampicillin, and subjected to mini-prep extraction using the Zyppy™ Plasmid Miniprep Kit (Zymo Research: D4036). The extracted plasmids were assessed for DNA concentration and purity using a Nanodrop Spectrophotometer and then analysed using long-read Sanger sequencing. Sequencing results were aligned with the original sequences using SnapGene for verification. Verified colonies were used to generate maxi-preps, ensuring high concentrations of the plasmid vector with the desired DNA sequences. For transfection, NSC34 cells were seeded to 60% confluency in 6-well plates and then transfected with WTT (control, n = 8) and MVT (mutant TDP-43, n = 8) using Lipofectamine™ 3000 Transfection Reagent (Invitrogen: L3000001) diluted in Opti-MEM. Plasmid DNA plus P300 reagents were also diluted in Opti-MEM. The diluted DNA and transfection reagent were mixed and added to fresh media. After incubation under normal conditions for up to 48 hours, transfection efficiency was assessed by observing green fluorescence before proceeding to RNA extraction.

RNA extraction and quality assessment

The culture media was removed and the cells were washed with ice-cold PBS. 1 ml of Trizol (Life Technologies: 15596018) was then added to each well, and plates were shaken for 5 minutes at room temperature on a rotary shaker at 300 RPM. The cell lysate was subsequently pipetted multiple times, transferred to a 1.5 ml centrifuge tube, and vigorously vortexed. After a 5-minute incubation at room temperature, 200 µl of chloroform was added to the lysate, vortexed, and incubated for 3 minutes at room temperature. Following centrifugation at 12,000 g for 15 minutes at 4 °C, 400 µl of the upper aqueous phase was transferred to a fresh 1.5 ml centrifuge tube, and an equal volume of chloroform was added. After another centrifugation at 12,000 g for 15 minutes at 4 °C, 400 µl of the upper aqueous phase was transferred to another fresh 1.5 ml centrifuge tube and 500 µl of isopropyl alcohol was added to the samples, mixed, and incubated at room temperature for 10 minutes. The samples were then centrifuged at 12,000 g for 10 minutes at 4 °C. After removal of the supernatant, the RNA pellet was washed twice with 1 ml 70% ethanol, followed by centrifugation at 7,500 g for 5 minutes. The pellet was then air-dried for 30 minutes, and the RNA resuspended in 30 µl DEPC-treated water. RNA quantification was performed using a Nanodrop Spectrophotometer. The integrity of total RNA was assessed using the Agilent 2100 Bioanalyzer with the RNA 6000 Nano Kit (Agilent Technologies, Ltd.). Libraries were exclusively prepared for samples with an RNA Integrity Number (RIN) greater than 9.0 (Fig. 2).

Fig. 2
figure 2

RNA quality assessment. RNA quality was evaluated across all experimental conditions, including non-silencing control (NS), TDP-43 knockdown (TDP43-KD), wild-type TDP-43 expression (WTT), and mutant TDP-43M337V expression (MVT). Electropherograms display characteristic ribosomal RNA peaks (18S and 28S), and the corresponding RNA Integrity Number (RIN) values are reported for each sample. All samples exhibit RIN values above 9, confirming high RNA integrity and suitability for RNA-seq library preparation.

Library preparation and RNA-sequencing

Library preparation was performed using the SEQuoia Complete Stranded RNA Library Prep Kit (Bio-Rad, cat. No. 17005726) according to the manufacturer’s protocol. Ribosomal depletion was achieved using the SEQuoia Ribodepletion Kit (Bio-Rad, cat. no. 17006487) as per the manufacturer’s instructions. Libraries were then constructed from 500 ng of total RNA. The procedure involved RNA fragmentation, end repair, addition of adenosines to the 3′ ends, and adapter addition with cDNA synthesis. Subsequently, cDNA fragments containing adapters were amplified by PCR. Library quality assessment was performed using the Agilent 2100 Bioanalyzer with the Agilent DNA High Sensitivity chip (Agilent Technologies, Ltd.), with a mean library size of 300 bp. Quantification of libraries was conducted using a Quantus fluorometer and QuantiFluor double-stranded DNA System (Promega, Madison, Wisconsin, USA). Sequencing was carried out as single-end 1 × 75 bp reads on the NovaSeq 6000 platform (Illumina, San Diego, USA) to generate approximately 50 million reads per library.

Quality control of sequencing reads

Raw data files - Binary Base Call (BCL) files generated by the Illumina sequencer (NovaSeq 6000) were converted using Bcl2Fastq tool (v2.20.0.422)15 to the text-based sequence file format that stores both raw sequence data and quality scores (FASTQ). The preprocessing of FASTQ files was performed using SeqSense NGS Data Analysis Toolkit from Bio-Rad (version 1.0)16 with the default parameters and with the option–skipUMI to skip the deduplication step. All the tools used in the implemented pipeline are presented in Fig. 1. The RNA-seq data were subjected to a series of quality control steps to ensure high-quality reads for downstream analysis. Initially, the raw FASTQ files were assessed using FASTQC (version 0.11.7)17 to evaluate the quality of the sequencing reads. Parameters such as per base sequence quality, GC content, and adapter content were inspected to identify potential issues (Fig. 3). Next, adapter sequences and low-quality bases were trimmed using Cutadapt (version 2.3)18 with Python 3.6.9. The following parameters were used: a minimum quality score of 0, a minimum read length of 15 bases, with an adapter sequence consisting of 10 repetitions of adenine to be trimmed, and the first nucleotide from the reads to be removed. This step removed residual adapter contamination and improved the overall quality of the reads (Fig. 3).

Fig. 3
figure 3

Overview of sequencing quality control. The left plots correspond to the KD vs. NS experiment, while the right plots correspond to the MVT vs. WTT experiment. (a) Per-base sequence quality scores demonstrate consistently high read-quality across the entire read length for all samples in both experiments, with most bases having Phred scores ≥ 30. (b) Per-sequence Phred quality scores show that the majority of reads exhibit high sequence quality, with Phred scores predominantly above 30, ensuring reliable base calls across all samples. (c) Sequence count distributions illustrate the estimated proportions of unique (blue) and duplicate (black) reads in each sample. The majority of reads are unique, confirming high sequencing depth and library complexity suitable for downstream differential expression analyses.

Alignment and quantification

The trimmed reads were then aligned to the mouse reference genome GRCm38/mm1019 using STAR (version 2.7.0 f)20. STAR was configured with default parameters. Post-alignment, SAM files were converted to BAM format, sorted, and indexed using Picard (version 2.20.0)21. The Picard tools were also used to mark duplicate reads to reduce bias in the downstream quantification steps. Bedtools (version 2.28.0)22 was used for the intersection of the reads that align within an annotated BED file containing known small and long RNAs (Fig. 1b).

Gene expression quantification on the small and long RNA BAM files with appropriate RNA annotation set was performed using the featureCounts function from the Subread package (version 1.6.4)23, exporting raw counts, and normalised expression in RPKMs and TPMs. The aligned reads were assigned to mouse genes annotated in the GRCm38 genome assembly. The resulting expression matrix was used for subsequent bioinformatics analyses (Fig. 1c). The MultiQC tool (version 1.14)24 was used in the SeqSense NGS Data Analysis Toolkit outputs to produce a merged report with the sequence’s quality.

RNA expression profiling

A quality control analysis of gene expression data was conducted, and the expression profiles for each experimental group were briefly examined using a customized pipeline of R scripts (R version 4.4.0)25. For this analysis, several R packages were used for data tidying, wrangling, and visualization, including tidyverse (version 2.0.0)26, DESeq2 (version 1.44.0)27, pheatmap (version 1.0.12)28, and patchwork (version 1.2.0)29.

To evaluate the overall similarity between samples, sample-level quality control was performed using two unsupervised clustering methods: Principal Component Analysis (PCA) and hierarchical clustering (Fig. 4). Both analyses were based on log2-transformed normalized counts to enhance distance metrics and clustering clarity for visualization. To mitigate bias from low-count genes, a regularized log transformation (rlog) was applied instead of a standard log2 transformation, as implemented by the rlogTransformation() function in the DESeq2 package (additional details on this approach are available in the DESeq2 vignette).

Fig. 4
figure 4

Sample variation and expression profile of TDP-43. (a) Principal Component Analysis (PCA) plot illustrating the clustering of samples based on gene expression profiles. Samples group according to their experimental conditions (NS, KD, WTT, and MVT), indicating consistent variation and clear segregation between groups, supporting the robustness of the dataset. (b) Heatmap of hierarchical clustering based on sample correlation coefficients. Strong intra-group similarity and distinct inter-group differences confirm the expected transcriptomic variation between conditions, validating the experimental design. (c) Bar plot of Tardbp (TDP-43) expression across experimental conditions. KD samples exhibit the lowest TDP-43 expression, confirming effective gene knockdown, while WTT and MVT samples show elevated expression, consistent with overexpression constructs. The NS control group exhibits intermediate endogenous expression levels, aligning with expected modulation based on genetic interventions.

Data Records

All sequencing data have been deposited in the ArrayExpress data repository with accession number E-MTAB-1373830. Metadata, including detailed experimental procedures, are also available at this repository. This dataset consists of 32 RNA sequencing raw data files in FASTQ format, corresponding to 8 replicates per experimental condition (NS, KD, WTT, and MVT). These raw reads are accessible via the European Nucleotide Archive (ENA) under accession number PRJEB7219931. Processed data (in plain text tabular format) containing counts and TPM normalised expression per gene feature (mouse genome annotation GRCm38) are deposited in Figshare32.

Technical Validation

Validation of experimental design strategy

The experimental design was assessed for any sources of variation and potential outliers by testing the sample variance via PCA (Fig. 4a). The first and second components accounted for 75% of the variance (67% in PC1 and 8% in PC2), clearly separating the four experimental groups into distinct clusters without overlap between replicates. This result supports the relevance of this dataset to test for differential expression between the experimental groups. Similarly, the hierarchical clustering of the sample pairwise correlation matrix (Fig. 4b) aligns with the experimental design, reinforcing the consistency across replicates.

Additionally, the expression levels of the Tardbp gene across the samples confirmed robust implementation of the experimental manipulations (Fig. 4c). As expected, the knockdown (KD) samples exhibit the lowest expression of the target gene. Conversely, the cells containing overexpression vectors (both the wild-type protein (WTT) and the mutant protein (MVT)) showed the highest expression values. The control group (NS), where the cells express endogenous gene levels, displays intermediate Tardbp expression values.

Validation of expression profiling

To validate the quantification and normalization steps, we assessed the read count distributions before and after TPM normalization, comparing the gene expression profiles across the four experimental groups (Fig. 5, top plot: before normalization; bottom plot: after normalization). The results clearly demonstrate that the samples share a consistent overall expression distribution post-normalization, confirming their suitability for meaningful future gene expression comparisons (e.g. differential expression analysis).

Fig. 5
figure 5

Impact of read count normalization. In both panels, the top plots represent raw expression data, while the bottom plots show data after TPM (transcripts per million) normalization. All values are log2-transformed for better visualization. (a) Boxplots of log2(counts) and log2(TPM) show gene expression distribution across samples before and after normalization. The consistent median expression levels across conditions indicate effective normalization, ensuring comparability across samples. (b) Density plots of log2(counts) and log2(TPM) illustrate the impact of normalization on expression distributions. The improved overlap across groups in the bottom panel demonstrates greater distribution consistency, reducing batch effects and making samples suitable for differential expression analysis.

Coding and non-coding gene expression

To validate the ribosomal depletion RNA-seq protocol and confirm the presence of both coding and non-coding genes in the dataset, we performed a biotype annotation count. The results, presented in Fig. 6 confirm the expected representation of both coding and non-coding genes. The bar plot highlights the distribution of the most frequent biotypes, classified into four groups: protein-coding genes, processed pseudogenes, and long non-coding RNAs (lncRNAs), with less abundant non-coding biotypes grouped under “other non-coding” RNA species. The inclusion of non-coding genes in this dataset provides an opportunity to explore both the transcriptomic landscape and the regulatory mechanisms of gene expression within the same dataset.

Fig. 6
figure 6

Biotype annotation counts. Distribution of coding and non-coding gene biotypes among the analyzed features. Protein-coding genes represent the majority (17,148 genes), followed by processed pseudogenes (2,869 genes), long non-coding RNAs (lncRNAs, 2,040 genes), and other non-coding RNAs (328 genes). In total, 22,385 genes were annotated by biotype out of 35,144 total genes present in the dataset, highlighting the inclusion of both coding and non-coding transcriptomic features.

Usage Notes

The gene expression data presented here provide a valuable resource for investigating the effects of TDP-43 depletion and mutant expression in NSC-34 cells, a widely used cellular model of ALS and related NDDs. This dataset supports differential expression analysis, enabling researchers to identify key cellular functions affected by the depletion, as well as overexpression of wild-type or mutant TDP-43. Importantly, the inclusion of non-coding RNAs expands the analytical potential by allowing the construction of comprehensive gene regulatory networks. This approach offers deeper insights into the interplay between coding and non-coding RNAs in regulating gene expression in motor neurons following TDP-43 loss-of-function or mutation.