Rare dermatological conditions (or orphan diseases) present major diagnostic challenges due to their low prevalence, limited clinician exposure, and the scarcity of well-labeled datasets, which together hinder the development of conventional AI systems. As a result, most deep learning models trained on supervised approaches perform well only on common skin diseases while failing to generalize to rare conditions, leaving a significant gap in clinical support and contributing to delayed diagnoses and worse patient outcomes, especially in regions with limited specialist access. To address this limitation, contrastive language-image pre-training offers a promising alternative by leveraging paired dermatological images and unstructured clinical notes from electronic health records in a self-supervised manner. This allows models to learn meaningful visual–textual relationships without requiring large-scale manual annotation. The framework typically includes an image encoder, a clinical text encoder, a contrastive alignment objective, and a zero-shot classification mechanism based on prompt similarity. By learning from existing multimodal clinical data, such systems can generalize to previously unseen rare conditions and enable zero-shot diagnosis, reducing dependence on labeled datasets. This approach transforms routine physician documentation into a rich supervisory signal, helping overcome annotation bottlenecks and improving AI applicability in real-world dermatology settings. Ultimately, foundation models trained in this way offer a scalable path toward more inclusive and effective AI-assisted diagnosis of rare skin diseases.