This thesis presents an end-to-end object reconstruction pipeline that includes image data acquisition, processing, and visualization. The proposed system produces high-fidelity 3D models by tracking a handheld object in a sequence of RGB-D frames. In this process, a semantic segmentation network is used to remove the operator hand from the frame to create an object mask. The mask is then used to track the pose of the object over time. A truncated signed distance function representation is used to fuse the aligned frames into a global model.
Previous approaches either target static scene reconstruction therefore ignoring dynamic elements, or depend on the availability of a well-defined reference platform for scanning. As opposed to conventional scanning approaches, the proposed scanning pipeline involves the handling of the object by an operator in the field of view of a stationary depth-enabled camera, similar to how humans visually inspect objects. As a result, the scanning process is intuitive and easy to use for the end-users. Further, a two-stage training regime is proposed to enhance hand segmentation accuracy. Our approach can serve as an end-to-end scanning module capable of producing true-to-scale 3D CAD models ready for use in other computer vision tasks such as object tracking or pose estimation.
A manually labeled hand segmentation dataset was created to evaluate the effectiveness of the proposed training regime, different segmentation algorithms, and normalization techniques. The performance of the proposed pipeline is demonstrated by scanning a variety of objects scanned using two different camera setups. Reconstructed models are compared against ground truth with reconstruction accuracy being measured through a normalized error metric.