Uncrewed Aerial Vehicles (UAVs) have become a pivotal platform for acquiring data in various applications, particularly for inspection, monitoring, and modeling purposes. However, the limited flight time and energy consumption of UAVs have necessitated the development of intelligent data acquisition systems for these platforms. In cases where there is no geometric proxy of the target and its location is unknown, it becomes crucial to establish a model that enables the detection of the target through visual data. Subsequently, the UAV incrementally acquires data as it navigates around the target, utilizing onboard sensors to complete its interpretation of the target or the scene. To accomplish this, the UAV adheres to a strategy known as view planning, which plays a critical role in three dimensional (3D) reconstruction of infrastructure using UAV-based imaging and significantly influences the quality of the reconstruction results. The selected views to be captured must essentially reveal the most unknown information about the target to ensure efficiency as well as utility. In this work, we propose three essential components of an intelligent data acquisition system: i) identification of the optimal views to prioritize, ii) multi-task sensor fusion for depth completion and object detection, and iii) reinforcement learning of appearance-based next-best-view (NBV) planning. The main focus of this study is to establish a relationship between the visual features observed in 2D images and the corresponding 3D model of the target, aiming to avoid the computational cost of handling 3D data.