Convolutional neural networks (CNN) are widely used in the recognition and classification of scene images due to their effectiveness in this task. However, their applicability is not quite as favorable when used with variations of parameters such as rotation, scaling, and translation in input data. To overcome this drawback, this study presents a feature fusion technique that combines Hu moments with deep learning features derived from the CNN model. Hu’s moments of an image are statistical values obtained based on the intensities of the image pixels that are invariant to geometric transformations. These moments are then combined with the features of the fully connected layer of the CNN model, making the proposed method more accurate and robust. The study also utilizes data augmentation, specifically geometrical transformations such as rotating, scaling, flipping, and translation to balance class image distribution in training datasets and reduce interclass bias resulting from the imbalance in number of images within different classes. The fused feature representation was evaluated on three benchmark datasets: MIT67, AID and Scene15. Detailed experiments with different CNN models were conducted, and Inception- ResNetV2 as deep feature extractor combined with Hu Moments demonstrated the effectiveness of the proposed approach which delivers significant improvements in accuracy scores, Scene15: 96.4%, AID: 94.1% and MIT67: 87.1%. This result presents a novel avenue approach for enhancing the resilience and accuracy of Scene Understanding.