Chest X-ray is a commonly used imaging tool in both acute and routine care, but the increasing reporting workload highlights the need for structured preliminary reports that aid triage, reduce delays, and ensure clinical relevance. Current AI systems often focus on classification or generic report generation, neglecting critical factors like free-text radiology requests, clinical history, and comparison context, leading to reports that, while technically fluent, are insufficiently focused. This article proposes a multimodal vision-language model that interprets both chest X-ray images and free-text radiology requests to generate structured preliminary reports directly addressing the clinical question. The model combines a radiographic encoder based on vision transformers, a text encoder for requests and prior reports, a cross-modal attention module, and a structured report decoder, organizing the output into relevant sections such as indication, technique, findings, impression, comparison, and answer-to-request. By aligning report generation with the clinical request, the model ensures that it answers specific questions—such as concerns about pneumonia, pulmonary oedema, or pneumothorax—improving report relevance, reducing misinterpretation, and supporting safer human-in-the-loop review. However, its effectiveness relies on accurate alignment, factual consistency, uncertainty management, and validation in real-world radiology settings.