Immunotherapy with immune checkpoint inhibitors is a standard treatment for advanced non-small cell lung cancer (NSCLC), with durable responses in selected patients. Whole-slide histopathology images provide morphological and immune microenvironment information, while genomic expression data capture pathway activity and resistance mechanisms. Single-modality approaches based on either histopathology or genomics fail to capture complementary tumor information, limiting accurate stratification of responders and non-responders and leading to suboptimal treatment selection. We propose a multimodal fusion network that integrates whole-slide histopathology images and genomic expression data to predict immunotherapy response in NSCLC. Separate encoders process each modality, followed by cross-attention for joint representation learning in an end-to-end framework. The system includes a multiple instance learning-based WSI module, a gene expression encoder with attention over gene sets, and a cross-attention fusion module. The model outputs a binary or probabilistic prediction of treatment response using paired slide and genomic data. The model captures complementary morphological and molecular signals, linking immune infiltration patterns with transcriptomic activity. Attention mechanisms enhance interpretability by highlighting key tissue regions and gene pathways, while also improving robustness to partial modality missingness. This multimodal framework improves NSCLC immunotherapy response prediction by integrating histopathology and genomic data, offering a step toward more precise patient stratification in precision oncology.