Received: 13 February 2026; Revised: 17 March 2026; Accepted: 29 March 2026; Published Online: 31 March 2026.
J. Smart Sens. Comput., 2026, 2(1), 26205 | Volume 2 Issue 1 (December 2026) | DOI: https://doi.org/10.64189/ssc.26205
© The Author(s) 2026
This article is licensed under Creative Commons Attribution NonCommercial 4.0 International (CC-BY-NC 4.0)
PHQ-9 Based Depression Detection Using Text with Multi-
Task DeBERTa Model
Rahul Dagade,
1,*
Saee Darwatkar,
1
Payal Kalekar,
1
Aditya Kamble,
1
Prasad Hargude,
1
Nisha Godha,
2
and
Ganesh Jadhav
3
1
Department of Computer Engineering, Vishwakarma Institute of Technology, Pune, Maharashtra, 411037, India
2
Department of Electrical Engineering, Sinhgad Institutes, Pune, Maharashtra, 411046, India
3
Department of Information Technology, Vishwakarma Institute of Technology, Pune, Maharashtra, 411037, India
*Email: rahul.dagade@vit.edu (R. Dagade)
Abstract
Depression is a widespread mental health disorder that adversely affects emotional well-being, cognitive functioning,
and overall quality of life. Early and accurate detection is essential for timely intervention; however, traditional
screening methods such as the Patient Health Questionnaire-9 (PHQ-9) are often limited by accessibility and resource
constraints. To address these challenges, this study proposes a text-based automated depression screening system
that predicts both PHQ-9 scores and depression severity levels from user-generated free-form text. The proposed
approach uses DeBERTa-V3, a state-of-the-art transformer model, within a multi-task learning framework that
simultaneously performs regression and multi-class classification. The model was trained on the PHQ-TextSet dataset,
a synthetically constructed and PHQ-9-aligned corpus comprising 3,235 annotated samples across five severity
categories. By leveraging disentangled attention and shared contextual representations, the system effectively
captures nuanced linguistic and emotional patterns. The experimental results demonstrate high performance,
achieving 99.85% classification accuracy, a macro F1 score of 0.9984, a weighted AUC of 0.96, and a mean absolute
error of 1.2495 for PHQ-9 score prediction. However, these results should be interpreted as upper-bound estimates
because of the structured and controlled nature of the dataset, which differs from real-world, noisy text inputs. The
model is deployed as a real-time inference system using FastAPI, highlighting its practical applicability in digital mental
health platforms, telemedicine systems, and educational wellness tools. This work presents a scalable and accessible
alternative to traditional screening methods while maintaining alignment with clinical assessment standards. Future
work will focus on evaluating the model on real-world datasets, enhancing robustness, incorporating multimodal
data, and improving interpretability to support responsible AI deployment in mental healthcare.
Keywords: Depression detection; PHQ-9 score prediction; DeBERTa; Multi-task learning; Natural language processing;
Text-based analysis; Mental health screening.
1. Introduction
Depression is among the most prevalent mental health disorders worldwide, affecting more than 280 million
individuals globally according to the World Health Organization (WHO).
[1]
It significantly impairs emotional well-
being, cognitive functioning, academic performance, and professional productivity. If left undetected and untreated,
depression can progress to more severe psychiatric conditions, including anxiety disorders and suicidal ideation. Early
detection is therefore critical for enabling timely intervention and minimizing long-term adverse outcomes.
The Patient Health Questionnaire-9 (PHQ-9) is a widely adopted standardized clinical instrument for assessing
depression severity.
[2]
It consists of nine items corresponding to the Diagnostic and Statistical Manual of Mental
Disorders (DSM-5) criteria for major depressive disorder. Each item is scored on a 0–3 scale, yielding a total score
ranging from 0 to 27, which maps to five severity categories: minimal (0–4), mild (5–9), moderate (10–14), moderately
severe (15–19), and severe (20–27). Despite its clinical validity, the PHQ-9 requires structured questionnaire
administration and professional interpretation, creating barriers to accessibility, particularly in low-resource settings
and among individuals with limited access to mental healthcare.
With the rapid expansion of digital communication platforms, individuals increasingly express their emotional states,
personal struggles, and mental health concerns through free-form written text, including social media posts, online
forums, and messaging applications. This has created a unique opportunity for natural language processing (NLP) and
deep learning techniques to enable automated, scalable, and nonintrusive depression screening by analyzing the
linguistic content of user-generated text.
[3]
Existing transformer-based approaches for depression detection have focused predominantly on binary classification
tasks (depressed vs. non-depressed) using social media data. However, few studies have jointly addressed clinically
relevant continuous PHQ-9 score estimation and categorical severity classification from free-form user text.
Furthermore, most prior work relies on structured clinical notes or multimodal inputs, limiting their applicability to
general-population digital platforms.
This paper addresses these gaps by proposing an end-to-end multi-task deep learning framework that employs
DeBERTa-V3, a disentangled-attention transformer model, to simultaneously predict the PHQ-9 score and classify
depression severity from user-entered free-form text. The system is trained and evaluated on the PHQ-TextSet (PHQ-
9 Text Assessment Dataset), a purpose-built PHQ-9-aligned textual dataset containing 3,235 annotated samples,
constructed by the authors. The trained model is deployed using FastAPI, enabling real-time inference suitable for
integration into digital mental health platforms and telemedicine systems.
The remainder of this paper is organized as follows. Section 2 reviews the related literature on text-based depression
detection and PHQ-9 assessment. Section 3 presents the proposed multi-task DeBERTa architecture. Section 4
describes the learning algorithm and mathematical formulation. Section 5 presents the experimental setup, dataset
description, and results. Section 6 provides a comparative analysis with existing approaches. Section 7 concludes the
paper with limitations and future directions.
2. Related work
Early research on automated depression detection predominantly employed traditional machine learning methods
combined with handcrafted linguistic features. Studies have leveraged support vector machines (SVMs),
[4]
Naïve
Bayes classifiers, and logistic regression
[5]
on text corpora derived from social media platforms and online forums.
Pennebaker et al.
[6]
demonstrated through the Lingistic Inquiry and Word Count (LIWC) framework that depressed
individuals exhibit characteristic linguistic patterns, including elevated first-person pronoun usage and increased
negative affect vocabulary. These findings established the theoretical foundation for text-based depression analysis.
However, such feature engineering approaches suffer from limited contextual understanding and poor generalizability
across domains.
Research on automated PHQ-9-based assessments has explored structured questionnaire responses, electronic health
records (EHRs), and clinical text to predict PHQ-9 scores or symptom severity.
[7]
Alves et al.
[8]
proposed a machine
learning model that estimates PHQ-9 symptom severity from clinical notes, reporting an area under the curve (AUC)
of approximately 0.81. While this demonstrated the feasibility of PHQ-9 prediction from text, their approach relied
exclusively on structured clinical documentation and addressed only single-task severity prediction, limiting its
applicability to general-population real-time screening.
Transformer-based language models have substantially advanced the state of the art in NLP-based mental health
assessment, building upon earlier sequence models such as LSTMs
[9]
and deep recurrent networks.
[10]
Devlin et al.
[11]
introduced BERT (Bidirectional Encoder Representations from Transformers), which provides deeply contextualized
token representations through masked language modeling. BERT-based architectures have been successfully applied
to depression detection from social media text, demonstrating significant accuracy improvements over traditional NLP
methods.
[12]
He et al.
[13]
proposed DeBERTa (Decoding-Enhanced BERT with Disentangled Attention), which
separately encodes content and positional information, yielding superior contextual modeling capabilities that are
particularly relevant for emotionally nuanced mental health text.
Multi-task learning (MTL) has emerged as a promising paradigm for mental health NLP because it exploits the inherent
correlations among related symptom indicators. Compared with their single-task counterparts, MTL frameworks
applied to depression-related tasks have demonstrated improved generalizability and robustness because they share
learned representations across correlated objectives.
[14,15]
Ye et al.
[16]
explored multimodal depression detection
combining audio and textual features; Tsai et al.
[17]
further extended this method with multimodal transformers for
unaligned sequences, achieving promising results; however, such approaches required specialized hardware and
structured input modalities, limiting scalability. Trotzek et al.
[18]
utilized neural networks combined with linguistic
metadata for early detection of depression from text sequences and demonstrated strong performance in binary
classification tasks.
Despite these advances, the research literature reveals a persistent gap: few systems simultaneously perform clinically
calibrated PHQ-9 score regression and categorical severity classification from unstructured, free-form user text.
Moreover, most existing systems do not address real-time deployment, confidence-aware outputs, or integration with
digital health platforms. The proposed system connects these gaps by combining DeBERTa-V3 with a multi-task
learning architecture trained on the PHQ-TextSet dataset,
[19]
enabling automated, real-time, and clinically relevant
depression screening.
3. Proposed architecture
3.1 Overview
The proposed system is a text-based depression detection architecture designed to predict both the PHQ-9 assessment
score and the corresponding depression severity level from user-entered free-form text. The architecture follows an
end-to-end deep learning pipeline that integrates text preprocessing, transformer-based contextual encoding and multi-
task learning to enable clinically aligned, real-time depression screening. The core of the system is a pretrained
DeBERTa-V3-base transformer encoder, which generates rich semantic representations from natural language input
and supports the simultaneous optimization of regression and classification objectives. The architecture is specifically
designed to operate on unstructured user-provided text, without requiring structured questionnaires or clinical
annotations at inference time. Fig. 1 shows system architecture for proposed system.
Fig. 1: System architecture: End-to-end pipeline from free-form text to multi-task PHQ-9 outputs.
3.2 Module-wise explanation
The architecture comprises four interconnected modules: a text preprocessing module, a transformer encoder layer, a
multitask learning layer and an output layer. Data flow sequentially from the raw text input through tokenization,
contextual embedding, task-specific prediction and output generation. This modular design ensures scalability,
interpretability and extensibility for future multimodal integration.
3.2.1 Text preprocessing module
User-generated text is inherently noisy and contains informal language, spelling variations, incomplete sentences and
emotionally charged expressions. The preprocessing module employs the DeBERTa SentencePiece tokenizer, which
applies subword-level tokenization to effectively handle out-of-vocabulary terms and morphologically complex words
frequently observed in mental health-related text.
Input sequences are standardized through padding (appending special tokens to bring sequences to the maximum
length) and truncation (clipping sequences exceeding 256 tokens). This standardization ensures uniform input
dimensions for efficient batch processing. The use of subword tokenization helps the model handle rare vocabulary,
such as uncommon or newly formed words (e.g., “overthinking”, “unmotivated”, “worthlessness”), and domain-
specific terminology, such as mental health-related expressions (e.g., “loss of interest”, “feeling hopeless”, “low
energy”), which are commonly used to describe depressive symptoms.
Fig. 2: Data processing and inference pipeline for PHQ-9 prediction from free-form text.
3.2.2 Transformer encoder layer
The transformer encoder layer constitutes the core semantic processing component of the architecture. DeBERTa-V3-
base employs a disentangled attention mechanism that separately encodes token content and positional information, in
contrast to standard BERT-style models that combine both in a single representation. This disentanglement enables the
model to better capture nuanced semantic relationships between contextually distant tokens, which is particularly
important for understanding emotional context and symptom-related expressions in free-form text.
The tokenized input is processed through 12 transformer layers, each applying multi-head self-attention and feed-
forward transformations. The output is a collection of contextualized token embeddings, from which the [CLS] token
representation (H₀) is extracted as a 768-dimensional global semantic summary of the entire input sequence. This
representation encodes both syntactic structure and affective content, forming the shared feature vector for downstream
multi-task prediction.
3.2.3 Multi-task learning layer
The multi-task learning layer enables simultaneous optimization of two clinically correlated tasks: PHQ-9 score
regression and depression severity classification. The shared [CLS] representation H₀ is fed into two parallel task-
specific prediction heads.
The regression head applies a linear transformation to H₀ to produce a scalar PHQ-9 score prediction:


=     
(1×768)
,   (1)
The output is denormalized from [0,1] to [0,27] for clinical interpretation.
The classification head applies a linear transformation followed by softmax activation to produce a probability
distribution over five severity classes:


󰇛  󰇜 (2)

(5×768)
 . The predicted severity label is the argmax of this distribution. The joint training
objective minimizes the combined multi-task loss:

   (3)
where LMSE is the mean squared error regression loss, LCE is the cross-entropy classification loss, and λ = 1.0 (equal
task weighting in the baseline configuration). This joint optimization encourages the model to learn a shared
representation that benefits both tasks, improving generalization and consistency between predicted scores and severity
labels.
3.2.4 Output layer
The output layer aggregates and presents the three final predictions: The predicted PHQ-9 score is a continuous value
in [0, 27], and the predicted depression severity class (minimal, mild, moderate, moderately severe, or severe) and the
confidence score are computed as the maximum softmax probability from the classification head:
 󰇛

󰇜 (4)
The confidence score provides a measure of prediction reliability, supporting ethical and transparent clinical usage by
enabling practitioners to calibrate their trust in the automated assessment.
3. Learning Algorithm and Mathematical Formulation
3.1 Algorithm
Algorithm 1 summarizes the complete inference procedure of the proposed multi-task depression detection system:
Algorithm 1: Multi-Task Depression Detection with DeBERTa
Input: T (free-form text entered by the user)
Output: PHQ-9 score, severity label, confidence score
1. Tokenize the input text T using the SentencePiece tokenizer
2. Apply padding/truncation to a fixed maximum length (256 tokens)
3. Pass the tokenized sequence through the DeBERTa-V3-base encoder
4. Extract [CLS] token representation

5. 

   (regression head)
6. 

󰇛  󰇜 (classification head)
7. 󰇛

󰇜
8. 󰇛

󰇜
9. Denormalize:




10. Return




3.2 Mathematical formulation
3.2.1 Input Representation
Let
 󰇛  󰳛󰇜 (5)
denote the input text sequence of n subword tokens generated by the SentencePiece tokenizer. The tokenized sequence
is prepended with a special [CLS] token and appended with a [SEP] token, resulting in the following input sequence:
 󰇝󰇟󰇠    󰇟󰇠󰇞 (6)
3.2.2 Contextual Encoding
 󰇛󰇜 (7)
The DeBERTa encoder maps X to a sequence of contextual embeddings:
󰇛󰇜 󰇛  󰳛󰇜 (8)
where H
768
is the [CLS] representation used as the global semantic feature vector for both downstream tasks.
3.2.3 PHQ-9 score normalization and regression




(9)
The PHQ-9 score is normalized to the interval [0,1] as follows: y_norm =


The regression head predicts the
following:


   (10)
The regression loss is the mean squared error:

󰇛



󰇜 (11)
2.2.4 Severity Classification


󰇛  󰇜 (12)
The classification head produces a probability distribution over severity classes: 

󰇛 
󰇜  The predicted severity class is obtained as follows:
 󰇛

󰇜 (13)
The classification loss is computed using cross-entropy:
 

󰇛

󰇜 (14)
3.2.5 Combined multi-task loss



(15)
The overall training objective combines regression and classification losses:

   (16)
where λ = 1.0. The model parameters are updated using AdamW optimization with weight decay to minimize L
total over
the training set.
4.3 Evaluation metrics
The classification performance is evaluated using standard metrics, including the accuracy, precision, recall, and F1
score. The accuracy is defined as follows:

󰇛󰇜
󰇛󰇜
(17)


󰇛󰇜
(18)


󰇛󰇜
(19)
 
󰇛󰇜
󰇛󰇜
(20)
The regression performance is evaluated using the mean absolute error (MAE) and root mean square error (RMSE).
The MAE is defined as follows:



 (21)
Root Mean Squared Error:

󰇛 󰇜󰇜 (22)
Discriminative ability was assessed using the ROC-AUC curve. The confidence score is computed as max(

) from
the softmax output.
4. Experimental results
4.1 Dataset description
The PHQ-TextSet (PHQ-9 Text Assessment Dataset) is an original dataset constructed by the authors
[19]
to support
PHQ-9-aligned text-based depression screening. It comprises 3,235 annotated samples, each containing free-form
textual responses mapped to all nine PHQ-9 symptom domains. For each sample, the responses are aggregated into a
unified textual representation, enabling the model to capture holistic depressive symptom patterns rather than isolated
indicators.
The PHQ-TextSet dataset was constructed to simulate realistic patient responses aligned with the PHQ-9 symptom
criteria. Each sample represents a combination of textual expressions corresponding to the nine PHQ-9 questions,
including symptoms such as low mood, fatigue, sleep disturbances, loss of interest, and feelings of worthlessness.
To generate the dataset, a template-based and augmentation-driven approach was employed. For each PHQ-9 symptom
category, multiple semantically diverse sentence patterns were manually designed to reflect varying levels of severity
(e.g., minimal to severe). These templates were further expanded using paraphrasing techniques and lexical variation
to introduce diversity in expression.
The individual symptom-level texts were then aggregated into a single free-form paragraph to simulate natural user
input. Each generated sample was assigned a PHQ-9 score by summing the corresponding symptom severity levels
and mapping them to one of the five standard severity categories.
Although the dataset is synthetic, care was taken to ensure linguistic realism and clinical consistency with the PHQ-9
guidelines. This approach enables controlled experimentation while preserving alignment with established
psychological assessment frameworks.
Each sample is annotated with a numerical PHQ-9 score (range: 0–27) and a categorical severity label following the
standard PHQ-9 classification thresholds. The dataset encompasses five severity classes: minimal, mild, moderate,
moderately severe, and severe. The class distribution is approximately balanced across all categories, with detailed
statistics presented in Table 1.
Table 1: PHQ-Textset dataset statistics and label distribution.
Attribute
Value
Dataset Name
PHQ-TextSet (PHQ-9 Text Assessment Dataset)
Dataset Type
PHQ-9-aligned free-form text dataset (author-constructed)
Total Samples
3,235 (after removing 3 incomplete entries)
PHQ-9 Score Range
0 – 27
Severity Classes
5 (Minimal, Mild, Moderate, Moderately Severe, Severe)
Minimal (0–4)
635 samples (19.6%)
Mild (5–9)
644 samples (19.9%)
Moderate (10–14)
645 samples (19.9%)
Moderately Severe (15–19)
643 samples (19.9%)
Severe (20–27)
668 samples (20.6%)
Training/Validation Split
80%/20% (stratified)
Training Samples
2,588
Validation Samples
647
Maximum Token Length
256 subword tokens (SentencePiece)
Text Encoder
DeBERTa-V3-base (Hugging Face Transformers)
Input Type
User-entered free-form natural language text
Learning Strategy
Multi-task (Regression + Classification)
4.2 Hardware and software configuration
All the experiments were conducted using Python 3.10 and PyTorch 2.0. The DeBERTa-V3-based pretrained model
was loaded using the Hugging Face Transformers library (v4.35). Data preprocessing, splitting, and evaluation metrics
were implemented using Scikit-learn 1.3. Model training was performed on Google Colab Pro with an NVIDIA Tesla
T4 GPU with 16 GB of VRAM, enabling efficient parallel computation. The AdamW optimizer was used with a
learning rate of 2×10⁻⁵ and a weight decay of 0.01 to ensure stable convergence.
A linear learning rate warmup was applied over the first 10% of training steps to improve training stability. The model
was trained for 8 epochs with a batch size of 8. After training, the model was deployed as a RESTful API using FastAPI
(v0.103) to enable real-time inference and integration with external applications.
Table 2: Model hyperparameter configuration.
Hyperparameter
Value
Pretrained Model
microsoft/deberta-v3-base
Max Sequence Length
256 tokens
Optimizer
AdamW
Learning Rate
2×10⁻⁵
Weight Decay
0.01
Batch Size
8
Epochs
8
Warmup Ratio
0.1
MTL Loss Weight (λ)
1.0 (equal task weighting)
Dropout Rate
0.1 (applied to task heads)
4.3 Data split and stratified sampling
The PHQ-TextSet dataset was partitioned into training (80%) and validation (20%) sets using stratified random
sampling, preserving the class distribution of all five severity levels across both splits. Stratified splitting is essential
in mental health datasets to prevent class imbalance from biasing the evaluation. The training set (2,588 samples) was
used exclusively for model parameter optimization, while the validation set (647 samples) was reserved for
performance evaluation on unseen data.
4.4 Model training and convergence
The multi-task DeBERTa model was fine-tuned on the PHQ-TextSet training set using the AdamW optimizer with the
combined loss function as specified in Equation (16).
The training and validation loss curves were monitored across epochs to evaluate learning stability and detect
overfitting. As illustrated in Fig. 6, both training and validation loss decrease consistently across epochs, indicating
stable and effective learning. The training loss decreased significantly from an initial value of 0.4459 to a final value
of 0.0355 over 8 epochs, demonstrating that the model successfully learned meaningful patterns from the training data.
Similarly, the validation loss decreased from 0.0944 to 0.0051, confirming the strong generalizability of the results to
unseen data.
Although minor fluctuations are observed in the validation loss during intermediate epochs, the overall trend remains
downward. The relatively small gap between training and validation loss throughout the training process indicates that
the model does not suffer from overfitting or underfitting.
The smooth convergence of both loss curves demonstrates that the model achieves stable optimization and effectively
balances the regression and classification objectives in the multi-task learning framework.
4.5 Performance evaluation
The performance of the proposed system was evaluated using classification metrics (accuracy, precision, recall, and
F1 score) for severity prediction and regression metrics (MAE and RMSE) for PHQ-9 score estimation. Table 3
presents the quantitative results on the validation set, including per-class breakdown for all five severity categories.
Table 3: Performance evaluation on the PHQ-TextSet validation set.
Metric
Task
Value
Accuracy
Classification
99.85%
Macro Precision
Classification
0.9985
Macro Recall
Classification
0.9984
Macro F1-Score
Classification
0.9984
Mean Absolute Error (MAE)
Regression (PHQ-9)
1.2495
Root Mean Squared Error (RMSE)
Regression (PHQ-9)
2.31
Weighted AUC (ROC)
Classification
0.96
Minimal P: 1.000 / R: 0.9921 / F1:
0.9960
Classification
Support: 127
Mild — P: 0.9923 / R: 1.000 / F1: 0.9961
Classification
Support: 129
Moderate P: 1.000 / R: 1.000 / F1:
1.000
Classification
Support: 129
Moderately Severe P: 1.000 / R: 1.000
/ F1: 1.000
Classification
Support: 129
Severe — P: 1.000 / R: 1.000 / F1: 1.000
Classification
Support: 133
A comparison between the true and predicted severity class distributions is shown in Fig. 3, confirming the close
alignment between the predicted and actual class frequencies. The distribution of absolute prediction errors for PHQ-
9 score estimation is shown in Fig. 4. The scatter plot of the predicted versus ground-truth PHQ-9 scores is shown in
Fig. 5, demonstrating strong linear alignment. The error distribution (predicted minus true), which is approximately
normally distributed and centered near zero, is shown in Fig. 7, confirming low systematic bias.
Fig. 3: Comparison of true and predicted depression severity class distributions.
Fig. 4: Distribution of absolute prediction error for PHQ-9 score estimation.
Fig. 5: Predicted versus ground-truth PHQ-9 scores.
Fig. 6: Training vs validation loss.
The confusion matrix (Fig. 8) demonstrates near-perfect classification performance across all five severity classes. For
the Minimal class, 126 out of 127 samples were correctly classified (one misclassified as Mild). For the Mild,
Moderate, Moderately Severe, and Severe classes, all the samples were correctly classified (129, 129, 129, and 133
samples, respectively), yielding perfect per-class accuracy. This finding is consistent with the reported per-class
precision and recall values and confirms that the model has effectively learned discriminative representations for all
severity levels.
Fig. 7: Distribution of PHQ-9 prediction errors (Predicted − True).
Fig. 8: Confusion matrix for depression severity classification.
5. Comparative analysis with existing approaches
Table 4 presents a systematic comparison of the proposed system with representative existing approaches for
depression detection and PHQ-9 assessment, evaluating differences across input types, learning strategies, model
architectures, output formats, and evaluation metrics.
To evaluate the effectiveness of the proposed multi-task DeBERTa model, we compared its performance with several
baseline models commonly used in text classification tasks.
The following models were implemented for comparison:
Logistic Regression with TF-IDF features
[5]
Support Vector Machine (SVM)
[4]
BERT-base (single-task classification)
[11]
The results demonstrate that traditional machine learning models struggle to capture contextual and emotional nuances
in text, resulting in lower accuracy and F1 scores. While BERT achieves competitive performance, it is limited by its
single-task formulation and lack of explicit PHQ-9 score prediction.
In contrast, the proposed multi-task DeBERTa model significantly outperforms all the baselines by leveraging
contextual embeddings and jointly optimizing regression and classification objectives, leading to improved
consistency and generalizability.
These findings highlight the importance of transformer-based architectures and multitask learning for clinically
relevant mental health prediction tasks.
Table 4: Comparative analysis with existing approaches.
Aspect
Ref. [8]
Ref. [18]
Proposed system
Objective
PHQ-9 severity from
clinical text
Early depression
detection from text
PHQ-9 score + severity
from free text
Input Type
Structured clinical notes
Social media text
Free-form user text
Model
Traditional ML/NLP
CNN + linguistic
metadata
DeBERTa-V3 transformer
Learning
Strategy
Single-task
Single-task
Multitask (Regression +
Classification)
Output
PHQ-9 score only
Binary (depressed/not)
PHQ-9 score + severity +
confidence
Key Metric
AUC ≈ 0.81
F1 ≈ 0.72
Accuracy 99.85%, Macro
F1 0.9984, MAE 1.2495,
AUC 0.96
Deployment
Not reported
Not reported
FastAPI REST API
Dataset
Structured EHR notes
ReachOut social media
PHQ-TextSet (3,235
samples)
The proposed system outperforms both compared approaches in terms of output comprehensiveness and deployment
readiness. Compared with Alves et al.,
[8]
the proposed approach operates on free-form user text rather than structured
clinical notes, includes multi-task learning for simultaneous score and severity prediction, and achieves a higher
discriminative AUC of 0.96 versus 0.81. Unlike Trotzek et al.
[18]
, the proposed system provides clinically calibrated
PHQ-9 score estimation rather than binary classification and supports confidence-aware outputs for responsible
deployment.
7. Discussion
The experimental results confirm that the proposed multi-task DeBERTa architecture effectively learns shared
contextual and emotional representations from free-form text, enabling robust simultaneous prediction of PHQ-9
scores and depression severity. The overall classification accuracy of 99.85% and macro F1 score of 0.9984 across the
five severity classes demonstrate that the model generalizes well to unseen data without significant class bias. Per-
class analysis revealed perfect classification (F1 = 1.000) for the moderate, moderately severe, and severe categories,
whereas the minimal and mild classes achieved F1 scores of 0.9960 and 0.9961, respectively, reflecting the expected
ambiguity near the boundary thresholds. The MAE of 1.2495 for the PHQ-9 regression indicates that, on average, the
predicted score deviates by fewer than two points from the clinical ground truth, which is within the acceptable margin
for preliminary screening purposes. Three severity classes (moderate, moderately severe, and severe) achieved perfect
precision, recall, and F1 scores of 1.000, whereas the minimal and mild classes had near-perfect F1 scores of 0.9960
and 0.9961, respectively. The smooth convergence of both training and validation losses across 8 epochs, with a small
and consistent gap between them, indicates that the model avoids overfitting despite fine-tuning a large pretrained
transformer. The use of AdamW with weight decay and warmup scheduling contributes to training stability. The
disentangled attention mechanism of DeBERTa-V3 appears particularly well suited for mental health text, where the
relationship between emotionally laden content words and their positional context is critical for accurate severity
assessment. Confusion matrix analysis reveals that the majority of misclassifications occur between adjacent severity
classes (e.g., minimal–mild and moderate–moderate–severe boundaries), which is clinically expected given the
continuous and overlapping nature of the PHQ-9 score distributions near threshold boundaries. Severe cross-class
misclassifications (e.g., Minimally predicted as Severe) are rare, indicating that the model captures global severity
patterns effectively.
While the model achieves very high classification accuracy (99.85%) and F1 score (0.9984), this performance should
be interpreted in the context of the dataset characteristics. The PHQ-TextSet dataset is synthetically constructed, and
the PHQ-9 is aligned, resulting in well-structured and semantically clear samples. Unlike real-world user-generated
text, which is often noisy, ambiguous, and linguistically diverse, the controlled nature of the dataset reduces variability
and makes classification boundaries more distinct. Consequently, the reported performance represents an upper bound
estimate under controlled conditions. Future work will focus on evaluating the model on real-world datasets (e.g.,
social media text) to assess its generalizability and robustness.
Several limitations should be acknowledged. The PHQ-TextSet dataset, while purpose-built for PHQ-9 alignment and
constructed by the authors, may not fully capture the linguistic diversity of real-world user populations across different
demographics, languages, and cultural contexts. The system is intended solely as a supportive screening tool and not
as a clinical diagnostic instrument.
Ethical considerations are critical in the deployment of automated mental health screening systems. The proposed
model is intended solely as a preliminary screening tool and not as a substitute for professional diagnosis. Incorrect
predictions may lead to false reassurance or unnecessary concern, highlighting the importance of human oversight.
Additionally, the use of text-based data raises privacy concerns, particularly when deployed in real-world applications
involving personal user inputs. Appropriate data handling, anonymization, and user consent mechanisms must be
implemented.
Bias is another potential limitation, as the synthetic dataset may not fully represent diverse linguistic, cultural,
and demographic variations. Future work will focus on incorporating real-world datasets and conducting fairness
evaluations to ensure equitable performance across different populations.
Future work will explore dataset expansion with diverse real-world samples, clinical validation with mental health
professionals, multimodal input integration, and enhanced model interpretability through attention visualization and
explanation methods.
8. Conclusion
In this study, a multi-task DeBERTa-V3-based depression detection system that jointly predicts PHQ-9 scores and
depression severity levels from user-entered free-form text is presented. The proposed architecture integrates a
pretrained transformer encoder with separate regression and classification heads trained jointly using a combined MSE
and cross-entropy loss function. The system was trained and evaluated on the PHQ-TextSet dataset, an original author-
constructed dataset comprising 3,235 annotated text samples across five PHQ-9 severity classes. The model achieved
a classification accuracy of 99.85%, a macro F1 score of 0.9984, and a mean absolute error of 1.2495 for the PHQ-9
score regression on the validation set. The weighted AUC of 0.96 further demonstrates strong discriminative
performance across severity classes. These results confirm that transformer-based multitask learning can effectively
bridge the gap between clinical PHQ-9 assessment frameworks and everyday natural language expression, offering a
scalable and accessible alternative to structured questionnaire-based screening. The deployment of the trained model
as a real-time FastAPI service highlights the practical viability of the proposed system for integration into digital
mental health platforms, telemedicine systems and educational wellness applications. The confidence score output
supports transparent and responsible clinical usage. Future research directions include expanding the training dataset
with diverse real-world samples, incorporating multimodal inputs, conducting clinical validation studies, and
improving model interpretability for responsible AI deployment in mental healthcare.
CRediT Author Contribution Statement
Rahul Dagade: Conceptualization, Methodology, Formal analysis, Investigation, Validation, Supervision, Project
administration, Writing - Review & editing. Saee Darwatkar: Formal analysis, Investigation. Payal Kalekar: Data
curation, Resources, Software. Aditya Kamble: Data curation, Resources, Software, Writing - Review & editing.
Prasad Hargude: Data curation, Resources, Software, Writing - Review & editing. All the authors were supervised
and supported by their advisors from the corresponding institution. Nisha Godha: Conceptualization, Investigation,
Writing - Review & editing. Ganesh Jadhav: Data curation, Formal analysis, Investigation, Writing - Review &
editing. All authors have read and agreed to the published version of the manuscript.
Funding Declaration
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit
sectors.
Data Availability Statement
The PHQ-TextSet dataset used in this study was constructed by the Authors and is available upon reasonable request
to the corresponding author.
Conflict of Interest
There is no conflict of interest.
Artificial Intelligence (AI) Use Disclosure
The authors confirm that no artificial intelligence (AI)-assisted technologies were used in the writing of the manuscript,
and no images were generated or manipulated using AI. AI-based tools were used solely for language editing to
improve grammar, clarity, and readability, in accordance with journal policy. The authors take full responsibility for
the accuracy, originality, and integrity of the work.
Supporting Information
Not applicable.
References
[1] World Health Organization, Depression, 2025, Accessed: January 2026.
[2] K. Kroenke, R. L. Spitzer, J. B. W. Williams, The PHQ-9: Validity of a brief depression severity measure, Journal of
General Internal Medicine, 2001, 16, 606613, doi: 10.1046/j.1525-1497.2001.016009606.x
[3] H. Fisher, N. M. Jaffe, K. Pidvirny, A. O. Tierney, M. S. Vaidean, P. Dongre, C. A. Webb, Language-based detection of
depression with machine learning: systematic review and meta-analysis, NPJ Digital Medicine, 2026, 9, 273,
10.1038/s41746-026-02448-1.
[4] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning, 1995, 20, 273297, 1995, doi: 10.1007/BF00994018.
[5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V.
Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay, Scikit-learn: Machine
Learning in Python, Journal of Machine Learning Research, 2011, 12, 28252830, doi: 10.5555/1953048.2078195.
[6] J. W. Pennebaker, R. L. Boyd, K. Jordan, K. Blackburn, The development and psychometric properties of LIWC2015,
University of Texas at Austin, 2015.
[7] K. Milintsevich, K. Sirts, G. Dias, A case study of the PRIMATE dataset, Proceedings of the 9th Workshop on
Computational Linguistics and Clinical Psychology (CLPsych 2024), Association for Computational Linguistics, 2024,
166171.
[8] P. Alves, C. D. Marci, C. J. Cohen-Stavi, K. M. Whelan, C. Boussios, A machine learning model using clinical notes to
estimate PHQ-9 symptom severity scores in depressed patients, Journal of Affective Disorders, 2025, 357, 4554, doi:
10.1016/j.jad.2025.01.152.
[9] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation, 1997, 9, 17351780, doi:
https://doi.org/10.1162/neco.1997.9.8.1735.
[10] A. Graves, A. -R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, 6645-6649, doi:
10.1109/ICASSP.2013.6638947.
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, 1, 4171-
186, doi: 10.18653/v1/N19-1423.
[12] A. Raj, Z. Ali, S. Chaudhary, K. K. Bali and A. Sharma, "Depression Detection Using BERT on Social Media Platforms,
2024 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota
Kinabalu, Malaysia, 2024, pp. 228-233, doi: 10.1109/IICAIET62352.2024.10730329.
[13] P. He, X. Liu, J. Gao, W. Chen, DeBERTa: Decoding-enhanced BERT with disentangled attention, Proceedings of
International Conference on Learning Representations, 2021.
[14] G. Coppersmith, M. Dredze, C. Harman, Proceedings of the Workshop on Computational Linguistics and Clinical
Psychology: From Linguistic Signal to Clinical Reality, Baltimore, Maryland, USA. Association for Computational
Linguistics, 2014, 5160, doi: 10.3115/v1/W14-3207.
[15] Amir H. Yazdavar, H. S. Al-Olimat, M. Ebrahimi, G. Bajaj, T. Banerjee, K. Thirunarayan, J. Pathak, A. Sheth, Semi-
supervised approach to monitoring clinical depressive symptoms in social media, ASONAM '17: Proceedings of the
2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, 1191-1198, doi:
10.1145/3110025.312302.
[16] J. Ye, Y. Yu, Q. Wang, W. Li, H. Liang, Y. Zheng, G. Fu, Multi-modal depression detection based on emotional audio
and text, Journal of Affective Disorders, 2021, 295, 904-913, doi: 10.1016/j.jad.2021.08.090.
[17] Y-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L-P. Morency, R. Salakhutdinov, Multimodal transformer for unaligned
multimodal language sequences, Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, 2021, 65586569, doi: 10.18653/v1/P19-1656.
[18] S. Trotzek, S. Koitka, C. M. Friedrich, Utilizing neural networks and linguistic metadata for early detection of
depression indications in text sequences, IEEE Transactions on Knowledge and Data Engineering, 2020, 32, 588601,
doi: 10.1109/TKDE.2018.2885515.
[19] R. Dagade, S. Darwatkar, P. Kalekar, A. Kamble, P. Hargude, N. Godha, G. Jadhav, PHQ-TextSet: PHQ-9 Text
Assessment Dataset, Unpublished dataset, 2026.
Publisher Note: The views, statements, and data in all publications solely belong to the authors and contributors. GR
Scholastic is not responsible for any injury resulting from the ideas, methods, or products mentioned. GR Scholastic
remains neutral regarding jurisdictional claims in published maps and institutional affiliations.
Open Access
This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which
permits the non-commercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as appropriate credit to the original author(s) and the source is given by providing a link to the Creative Commons
License and changes need to be indicated if there are any. The images or other third-party material in this article are
included in the article's Creative Commons License, unless indicated otherwise in a credit line to the material. If
material is not included in the article's Creative Commons License and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view
a copy of this License, visit: https://creativecommons.org/licenses/by-nc/4.0/
© The Author(s) 2026