Hypertension affects over 1.4 billion adults worldwide, and antihypertensive dose titration is a common but complex clinical decision. Although electronic health records contain longitudinal data on medication adjustments and blood pressure outcomes, determining optimal individualized dosing remains challenging due to confounding in observational data, where patients receiving higher doses often have worse baseline health. We propose a transformer-based model with causal attention masking to estimate counterfactual blood pressure outcomes under alternative dose regimens. The architecture ensures temporal validity by preventing information leakage from future events and encodes medication dose changes in a continuous representation. It includes a dose encoder, outcome predictor, and counterfactual contrastive loss to distinguish between competing treatment paths. This framework learns patient-specific dose–response relationships and enables personalized predictions for antihypertensive adjustments. While it supports individualized treatment planning from observational EHR data, prospective validation is still required before clinical deployment.
Hypertension affects approximately 1.4 billion adults globally and is a primary modifiable risk factor for cardiovascular disease, stroke, and renal failure [1, 2]. Clinical guidelines recommend target blood pressure below 130/80 mmHg, but achieving this target frequently requires dose titration of first-line agents (ACE inhibitors, ARBs, calcium channel blockers, thiazides) or the addition of multiple drug classes [3, 4]. For an individual patient on lisinopril 10mg, the clinician faces a counterfactual question: Would increasing to 20mg lower blood pressure more effectively than adding amlodipine 5mg? Observational EHR data contain historical records of such dose changes and subsequent blood pressure measurements, yet extracting causal answers remains challenging because treatment decisions depend on patient characteristics [5, 6].
Standard machine learning approaches—including recurrent neural networks and conventional transformers—are confounded by indication when applied to treatment effect estimation [7, 8]. Patients who receive dose escalations are systematically sicker than those maintained on stable doses: they have higher baseline blood pressure, more comorbidities such as diabetes or chronic kidney disease, and poorer responses to initial therapy [9, 10]. A naive transformer predicting blood pressure from observed dose histories would learn that higher doses are associated with worse outcomes, precisely because higher doses are given to sicker patients. This bias cannot be resolved by simply including all measured covariates, as the same clinical logic generates the confounding structure [11, 12].
Causal inference methods for time-series data, including marginal structural models, G-computation, and instrumental variable approaches, have been developed to address confounding by indication [13, 14]. However, these methods rely on strong assumptions such as sequential ignorability (no unmeasured confounders) and positivity (non-zero probability of any treatment at any time), and they often require manual specification of propensity score models that become impractical when treatment options include multiple drugs and continuous doses [15, 16]. Furthermore, traditional causal methods do not naturally leverage the representational learning capabilities of modern sequence models, limiting their ability to capture complex temporal patterns in blood pressure trajectories and medication responses [17].
This paper proposes a causal transformer architecture with attention masking that directly encodes the temporal ordering required for counterfactual estimation of antihypertensive dose responses [5, 18]. The framework extends the standard transformer by introducing a causal mask that prevents information leakage from future time steps when predicting potential outcomes under alternative dose regimens [19]. Unlike conventional causal inference pipelines that separate confounder adjustment from outcome modeling, our approach learns a causally-valid representation of patient history end-to-end.
First-line antihypertensive drug classes include ACE inhibitors (e.g., lisinopril, ramipril), angiotensin receptor blockers (e.g., losartan, valsartan), calcium channel blockers (e.g., amlodipine, nifedipine), thiazide diuretics (e.g., hydrochlorothiazide, chlorthalidone), and beta-blockers (e.g., metoprolol, atenolol) [1, 2]. Dose-response curves vary across both drugs and patient subgroups: for ACE inhibitors, blood pressure reduction follows a log-linear relationship with dose from 5mg to 40mg, but the incremental benefit diminishes above 20mg in many patients. Current hypertension guidelines from the European Society of Cardiology and Hypertension Canada recommend target blood pressure below 130/80 mmHg for most adults, with treatment intensification when readings exceed this threshold on three consecutive measurements [20, 21].
Clinical practice proceeds through well-documented escalation patterns: monotherapy at low dose → titration to maximum tolerated dose → addition of a second agent from a complementary class → dual therapy titration → triple therapy with fixed-dose combination pills [6, 7]. The decision to increase an existing dose versus adding a new agent depends on multiple factors including the patient's current blood pressure, side effect profile, adherence, age, renal function, and prior medication failures [19, 22]. Fixed-dose combination therapies combining perindopril, indapamide, and amlodipine in a single pill have demonstrated improved adherence and blood pressure control compared to free combinations, suggesting that regimen complexity directly influences outcomes [23-26].
Confounding by indication arises when the treatment assignment (dose increase) is causally influenced by the same factors that predict the outcome (future blood pressure), creating non-causal associations [12, 13]. In hypertension management, clinicians escalate doses precisely when patients exhibit uncontrolled blood pressure despite current therapy, meaning that the indication for dose increase (high measured BP) is also a strong predictor of future high BP. Standard regression adjustment fails when confounders are measured with error, when interactions are misspecified, or when unmeasured factors such as dietary sodium intake or medication adherence drive both treatment decisions and outcomes [14, 27]. The problem is exacerbated in time-series settings where time-varying confounders (e.g., interim blood pressure readings) are themselves affected by prior treatments, creating feedback loops that bias naive estimators [15, 16].
Marginal structural models with inverse probability of treatment weighting address time-varying confounding by estimating stabilized weights that adjust for both baseline and time-dependent confounders [14, 17]. G-computation provides an alternative by directly modeling the outcome distribution under hypothetical treatment regimes, but both methods require correct specification of the propensity score and outcome models [18, 28]. Instrumental variable methods can handle unmeasured confounding when a valid instrument exists, but such instruments (e.g., clinician prescribing preferences) are difficult to identify in hypertension dose titration [29]. Recurrent marginal structural networks and adversarially balanced representations have been proposed to learn balanced representations of treatment history, yet these methods still rely on the sequential ignorability assumption that all confounders are measured [6-8].
The causal transformer framework takes as input a sequence of patient observations across time steps (typically weekly intervals): systolic blood pressure, diastolic blood pressure, current medication type, current dose in milligrams, co-medications, age, weight, and serum creatinine [17, 18]. The transformer encoder processes this sequence with a causal attention mask that prevents information from future time steps from influencing predictions at the current time. The output layer produces counterfactual predictions for blood pressure measurements at specified future horizons under alternative dose regimens that may differ from the dose actually administered [5, 19].
Figure 1 presents the proposed causal transformer architecture as a directional framework linking longitudinal EHR history, temporal attention restriction, counterfactual dose-regimen prediction, and individualized antihypertensive decision support.
Figure 1. Causal Transformer Framework for Counterfactual Antihypertensive Dose-Response Estimation from Observational EHR Data
The framework operates under three standard causal assumptions adapted to the time-series setting. Sequential ignorability requires that, conditional on the observed history up to time t (including all past treatments, outcomes, and covariates), the treatment assigned at time t is independent of the potential outcomes under any treatment sequence [6, 14]. Positivity requires that for any patient history observed in the data, there is a non-zero probability of receiving each possible dose level at that time. Consistency requires that the observed outcome for a patient who received a particular dose equals the potential outcome under that dose [7, 15, 16].
The framework adheres to four design principles motivated by clinical hypertension management. Causal validity requires that all predictions respect temporal ordering and do not use future information when estimating counterfactual outcomes under past dose decisions [8, 9]. Patient-specificity requires that the model produce individualized dose-response curves rather than population-average effects, enabling personalized treatment recommendations [10, 11]. Time-awareness requires that the architecture explicitly model the timing of dose changes, including the duration since the last dose adjustment. Uncertainty-awareness, though not fully addressed here, would require probabilistic outputs that reflect the increased uncertainty for counterfactual regimens far from the observed treatment path [1, 2].
Table 1 clarifies how each architectural element contributes to causal validity, counterfactual estimation, and clinical interpretability beyond conventional sequence prediction.
Table 1. Causal Design Logic of the Proposed Transformer Framework
Framework element | Causal problem addressed | Operational role in the model | Analytical contribution beyond standard ML |
Longitudinal EHR history | Treatment assignment depends on prior BP, comorbidity, renal function, and medication response | Encodes observed patient history before each dose decision | Shifts prediction from cross-sectional association to temporally conditioned estimation |
Causal attention mask | Future blood pressure or covariates may leak into treatment-effect prediction | Blocks attention from post-decision time steps | Enforces the “no future information” condition required for counterfactual estimation |
Dose change encoder | Medication class and dose intensity are heterogeneous and not purely categorical | Represents drug type, normalized dose, and dose-transition structure | Allows comparison across titration, switching, and add-on therapy decisions |
Time-aware positional encoding | Dose response depends on timing since treatment change | Encodes pre-dose and post-dose intervals relative to clinical decision points | Captures pharmacologic response latency and plateau effects |
Counterfactual prediction head | Observed outcomes exist only for the treatment actually received | Generates potential BP outcomes under alternative dose paths | Enables patient-level estimation of unobserved dose-response trajectories |
Contrastive counterfactual loss | Factual prediction alone may ignore dose variation | Separates representations for clinically distinct dose alternatives | Encourages treatment-sensitive rather than purely prognostic representations |
Positivity diagnostics | Some dose alternatives may be rare or absent for specific patient profiles | Flags unreliable counterfactual predictions in unsupported regions | Prevents overconfident extrapolation beyond observed clinical practice |
Sensitivity analysis | Sequential ignorability cannot be proven from observational EHR data | Tests robustness to unmeasured adherence, diet, socioeconomic status, and missingness | Makes causal uncertainty explicit rather than hidden inside model performance metrics |
The standard transformer attention mechanism computes attention weights between all pairs of positions in a sequence, allowing information to flow from future tokens to current predictions [5]. This design is appropriate for language modeling where the full context is known at inference but violates causal validity for counterfactual estimation: predicting the effect of a dose change at time t should not incorporate blood pressure readings from time t+1 that occur after the dose was administered. The causal attention mask modifies the attention matrix by setting weights to negative infinity for all pairs where the key position (source) is greater than the query position (target), ensuring that predictions for time t depend only on information from times ≤ t [22, 23].
The masking structure is defined as follows: for a sequence of length T, the attention weight from position i (query) to position j (key) is masked when j > i. When predicting the potential outcome at time t+1 under a dose change that occurred at time t, the transformer can attend to all pre-treatment information including baseline covariates, prior blood pressure readings, medication history, and the dose change decision itself—but cannot attend to any post-treatment outcomes or future covariate values [24, 25]. This masking structure operationalizes the "no future information" condition required for counterfactual identification in time-series settings, directly encoding the temporal ordering that standard causal inference methods impose through separate modeling steps [26, 27].
Standard positional encodings in transformers encode absolute or relative position in the sequence, but counterfactual dose estimation requires distinguishing periods relative to treatment changes. The framework uses relative positional encodings that explicitly encode the time since the most recent dose change and the time until the next scheduled blood pressure measurement [5, 28]. For a patient whose dose was increased from lisinopril 10mg to 20mg at week 4, the encoding for week 5 should indicate "1 week post-dose increase," while the encoding for week 3 should indicate "1 week pre-dose increase." This temporal structuring enables the attention mechanism to learn patterns such as "blood pressure changes typically plateau 4 weeks after dose adjustment" across patients with different absolute treatment timings [20, 21].
Each time step in the input sequence (typically weekly intervals, though irregularly sampled data can be handled with time-aware positional encodings) comprises a vector of clinical variables: systolic blood pressure (mmHg), diastolic blood pressure (mmHg), medication type encoded as a categorical variable with levels for ACEi, ARB, CCB, thiazide, beta-blocker, and combinations thereof, current dose in milligrams normalized by the maximum approved dose, co-medication indicators, age in years, weight in kilograms, and estimated glomerular filtration rate from serum creatinine [25, 26]. Missing values are common in EHR data; the framework uses forward imputation for blood pressure (carrying last observation forward) and indicator flags for missingness, though more sophisticated approaches such as missingness attention masks could be incorporated [27, 29].
The encoder consists of L stacked layers, each containing multi-head self-attention with the causal mask described in Section 4, followed by layer normalization and position-wise feedforward networks [1]. Each attention head computes scaled dot-product attention:
The counterfactual prediction task is defined as follows: given a patient's observed history up to time t, predict the blood pressure at time t+k (typically 4 weeks for antihypertensive dose response) under an alternative dose path "" that may differ from the dose actually administered [27, 28]. During training, the model learns from observed transitions where the actual dose and actual outcome are known; the objective is to minimize prediction error for factual outcomes while enforcing constraints that counterfactual predictions differ from factual predictions when the alternative dose differs from the observed dose. For a patient who received a dose increase from 10mg to 20mg at time t and had a subsequent blood pressure of 135/85 at t+4, the model must learn to predict that outcome for the factual scenario while also predicting what the blood pressure would have been under counterfactual scenarios (remain at 10mg, increase to 40mg, or switch to amlodipine) [8-10].
Standard training with only factual prediction error does not ensure that the model learns a valid causal mapping, as the model could ignore the dose variable entirely and predict based solely on baseline characteristics. The framework incorporates a contrastive loss that explicitly regularizes the representation to distinguish between different dose regimens [1, 11]. For a given patient and time point, the model generates predictions under multiple dose alternatives; the loss encourages the representations for different doses to diverge when the predicted outcomes differ, while remaining similar when the predicted outcomes are clinically equivalent. This approach builds on adversarial balanced representations and generative counterfactual estimation methods, adapted to the transformer architecture with causal masking [2, 3, 6]. The complete loss function combines factual mean squared error (between predicted and observed blood pressure for the actual dose), contrastive loss (encouraging dose-specific representation separation), and a regularization term that penalizes violations of the causal hierarchy constraints derived from the causal graph [4, 7].
For each patient at each clinical decision point, the causal transformer estimates individualized dose-response curves by feeding multiple counterfactual dose regimens through the encoder with the same historical context [12, 13]. The output includes predicted systolic and diastolic blood pressure at 4, 8, and 12 weeks post-dose-change for each candidate dose: for lisinopril, predictions are generated for 5mg, 10mg, 20mg, and 40mg (or up to the maximum approved dose). The predicted dose-response curve enables identification of the minimum effective dose—the lowest dose that achieves target blood pressure (below 130/80)—as well as estimation of the incremental benefit of each dose escalation step. For a patient with baseline systolic pressure of 148 mmHg on lisinopril 10mg, the model might predict that increasing to 20mg achieves 130 mmHg (10 mmHg reduction) but increasing to 40mg achieves only 128 mmHg (2 mmHg additional reduction), suggesting diminishing returns beyond 20mg [14, 20].
Table 2 translates the model’s counterfactual outputs into clinically interpretable decision options while identifying the principal causal validity threat associated with each option.
Table 2. Counterfactual Decision Matrix for Individualized Antihypertensive Dose Selection
Clinical decision option | Counterfactual question estimated by the model | Required model output | Main validity threat | Clinical interpretation |
Maintain current dose | What would BP be if the patient remained on the present dose? | Predicted SBP/DBP at 4, 8, and 12 weeks under no titration | Confounding from patients with stable disease being more likely to remain untreated | Appropriate when predicted BP reaches target or escalation benefit is minimal |
Increase current dose | What would BP be if the current medication were titrated upward? | Dose-response curve across approved dose levels | Confounding by indication, because escalation is given to patients with uncontrolled BP | Supports titration when predicted incremental BP reduction is clinically meaningful |
Increase to maximum tolerated dose | Would maximal dose provide additional benefit beyond moderate titration? | Marginal BP reduction from intermediate to high dose | Positivity violation in patients rarely prescribed high doses | Useful for identifying diminishing returns or excessive extrapolation risk |
Add second agent | Would combination therapy outperform dose escalation alone? | Predicted BP under add-on therapy versus higher monotherapy dose | Treatment-selection bias from comorbidity, side effects, and clinician preference | Supports drug-class diversification when monotherapy response is predicted to plateau |
Switch drug class | Would an alternative drug class produce better BP control? | Predicted BP under replacement regimen | Sparse switching data and unmeasured intolerance history | Relevant when prior response or subgroup profile suggests poor class-specific benefit |
Defer automated recommendation | Are predictions insufficiently supported by observed data? | Uncertainty estimate, overlap warning, or missingness warning | Weak positivity, missing covariates, suspected unmeasured confounding | Preserves clinical oversight when counterfactual estimates are unreliable |
Beyond individual predictions, the framework supports subgroup analysis to identify patient characteristics associated with differential treatment benefit [15, 16]. By aggregating individualized dose-response estimates across patient cohorts, the model can query which subgroups show large responses to dose escalation versus which subgroups benefit more from add-on therapy. Potential effect modifiers include age (older adults often show greater sensitivity to ACE inhibitors but higher risk of adverse effects), race (Black patients typically have smaller renin-angiotensin system-mediated responses and greater benefit from thiazides or CCBs), baseline blood pressure severity, presence of chronic kidney disease (which alters drug pharmacokinetics and contraindicates certain agents at high doses), and genetic polymorphisms in drug-metabolizing enzymes [21, 29]. The transformer's attention weights themselves can be analyzed to identify which clinical features the model relies upon when making dose-specific counterfactual predictions, providing a form of explainable AI that complements the causal framework [17, 18].
In the absence of a gold-standard counterfactual dataset (where both the observed outcome and the outcome under an alternative dose are known for the same patient), evaluation relies on backtesting procedures that assess factual prediction accuracy and calibration [19, 22]. The model's factual predictions (blood pressure following the observed dose) are compared to actual outcomes using metrics including mean absolute error, root mean squared error, and calibration of prediction intervals across clinically relevant subgroups (e.g., defined by baseline BP strata, age groups, and comorbidity categories). A well-specified model should show no systematic bias across dose levels: overprediction of blood pressure in patients who received high doses would suggest residual confounding [23, 24]. Temporal cross-validation (training on earlier time periods, testing on later periods) assesses whether the causal structure remains stable over calendar time, which is particularly important given changes in hypertension treatment guidelines and the introduction of fixed-dose combination pills during the study period [25, 26].
Performance is benchmarked against established causal inference methods for time-series treatment effect estimation, including inverse probability of treatment weighting marginal structural models, G-computation with recurrent neural networks, and doubly-robust estimators that combine outcome modeling with propensity score weighting [14, 27]. Since ground truth counterfactuals are never observed in real EHR data, benchmarking requires semi-synthetic datasets where the data-generating process is known and the true counterfactual outcomes can be computed. The simulation framework generates realistic hypertension patient trajectories based on clinical parameters extracted from published trial data (e.g., the SPRINT trial for intensive blood pressure lowering), with known dose-response functions and specified confounding structures [28, 29]. The causal transformer's mean squared error for counterfactual predictions is compared to benchmark methods across varying degrees of confounding and different patterns of treatment discontinuation, with particular attention to performance under positivity violations (e.g., no patient with stage 2 hypertension remains on the lowest dose for extended periods) [5, 20].
Three sensitivity analyses assess the robustness of conclusions to violations of the causal assumptions underlying the framework. Unmeasured confounding analysis introduces realistic unmeasured confounders (e.g., medication adherence estimated from pharmacy refill data but often missing in EHRs, dietary sodium intake, physical activity levels, socioeconomic status) into the simulation and quantifies how severely the causal transformer's estimates are biased as a function of the unmeasured confounder's strength [6, 21]. Positivity violation analysis examines performance in regions of the covariate space where certain dose levels are rarely prescribed, such as very high ACE inhibitor doses in patients with advanced chronic kidney disease, using propensity score overlap diagnostics to identify unreliable predictions. Missingness analysis evaluates the impact of non-random missing blood pressure measurements—for example, patients with poorly controlled hypertension may have more frequent clinic visits (more measurements) while those with excellent control may have fewer—by artificially inducing missingness patterns and comparing complete-case analysis to the proposed forward imputation with missingness indicators [7, 8].
The causal transformer inherits several technical limitations from its parent architecture and causal framework. The sequential ignorability assumption remains fundamentally untestable from observational data; although the causal attention mask enforces temporal ordering, it cannot detect or correct for unmeasured confounders that affect both treatment decisions and outcomes [9-11]. Attention patterns may capture spurious correlations that happen to satisfy the causal mask but do not correspond to genuine causal mechanisms, particularly when time-varying confounders are measured with error or at irregular intervals. The computational complexity of multi-head self-attention scales quadratically with sequence length O(T²), which becomes prohibitive for patients with very long hypertension histories (e.g., 10+ years of weekly observations), though this can be mitigated with sparse attention mechanisms or sequence truncation at the cost of losing early history information [1, 2, 5]. Finally, the framework models dose as a categorical or discrete-continuous variable, but continuous dose optimization (e.g., finding the exact milligram dose that achieves target BP) would require either extensive discretization or a fundamentally different approach such as a dose-conditioned neural network [3, 4].
From a clinical perspective, several important limitations constrain the framework's immediate applicability. Unmeasured confounders common in hypertension research—including medication adherence (patients who take 80% of prescribed doses vs those who take 50%), dietary sodium intake (which modifies BP response to ACE inhibitors and diuretics), physical activity, alcohol consumption, and socioeconomic status—remain major concerns that no purely observational method can fully resolve [12-14]. The framework assumes that treatment decisions occur at discrete time points (typically clinic visits), but in reality, patients may adjust doses in response to home blood pressure monitoring between visits, and such adjustments are rarely recorded in structured EHR fields. Validation in prospective randomized trials would be required before clinical deployment, as the framework cannot replace the evidence standard of a well-conducted randomized controlled trial such as those evaluating fixed-dose triple combinations or single-pill combinations [15, 16, 26, 27]. Additionally, the framework currently models blood pressure as the only outcome, but clinical decision-making also balances efficacy against adverse effects (e.g., cough with ACE inhibitors, edema with CCBs, electrolyte disturbances with thiazides), and extending the framework to multi-outcome counterfactual estimation with trade-offs remains an open challenge [17, 18].
This manuscript has presented a causal transformer architecture with attention masking for estimating counterfactual antihypertensive dose responses from observational electronic health records. The framework extends standard sequence models by explicitly encoding the temporal ordering required for causal identification, ensuring that predictions of potential outcomes under alternative dose regimens use only information available before the treatment decision. The causal attention mask, contrastive loss for counterfactual separation, and individualized dose-response estimation together provide a principled approach to personalized treatment recommendation from observational data.
The key advantages of this framework over conventional methods include its end-to-end learning of causally-valid representations without manual specification of propensity score models, its ability to handle both dose titration and add-on therapy decisions within a unified architecture, and its patient-specific predictions that capture treatment effect heterogeneity across clinically relevant subgroups. Unlike marginal structural models that require separate modeling steps, the causal transformer directly encodes the data-generating process of sequential decision-making in hypertension management.
Nevertheless, important limitations remain. The sequential ignorability assumption that all confounders are measured cannot be validated from observational data alone, and unmeasured confounders such as adherence, diet, and socioeconomic status may bias estimates in ways that no purely statistical method can detect. Missing data patterns in EHRs are often non-random, and the computational demands of full self-attention limit applicability to very long patient histories. Prospective validation in randomized controlled trials or targeted trial emulations would be required before clinical deployment.
We call for implementation of the proposed causal transformer framework on large-scale hypertension EHR cohorts, including the UK Biobank, the All of Us Research Program, and national hypertension registries from healthcare systems with comprehensive structured data on drug dosing and blood pressure measurements. Such implementations would enable empirical evaluation of the framework's performance relative to existing causal inference methods, characterization of the types of patients for whom counterfactual dose predictions are most reliable, and ultimately, the development of clinical decision support tools that provide evidence-based, personalized antihypertensive dosing recommendations. Translating this framework from proof-of-concept to clinical practice will require interdisciplinary collaboration among machine learning researchers, causal inference methodologists, clinical pharmacologists, and practicing hypertension specialists.
None
None
None
None
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.