Automatic image content description is a vital problem in artificial intelligence that connects computer vision and natural language processing. The primary challenge towards this goal is in the design of a multi-model approach that is rich enough to aim simultaneously about contents of images and their representation in terms of words or sentences. We present a multi-model approach based on a deep learning architecture that combines recent advances in computer vision such as; salient object proposal prediction, object detection to generate natural sentences describing an image. Leveraging recent advances in recognition of objects, their attributes and locations, allows us to drive natural language generation systems, however they are limited in their expressivity. Moreover, current object detection methods still suffer various problems in localization and processing time that make them unreliable and inadequate as they are still slow at test time. We shall attack this problem within three main research scenarios, i.e. category independent saliency based object proposals with two distinct classification schemes, Learning with a large but a single Operational Neural Network (ONN) and ‘Divide and Conquer’ Learning with ensembles of simple but ‘expert’ ONNs, each of which dedicated for a subset of the dataset with a certain homogeneity.