Modern computer vision models mostly rely on massive human annotated dataset for supervised training. The models are typically learnt from the supervision of a static dataset in a passive manner.
This work explores three new settings when huge dataset supervisions are scarce and novel learning paradigms beyond passive training are proposed. First, we addressed the active learning for histopathological images diagonal systems, and proposed an active selection algorithm via constrained submodular function maximization, the result shows the active selected training set is compact and outperform state of the art selection algorithms. Second, we proposed a novel semantic amodal segmentation task in which occluded object segmentation masks are predicted, and synthetic hard occluded examples are actively generated for training. Third, we addressed learning a visual grounding task via natural language interactions, in which two agents are trained to interact via interpretable dialogue to achieve a common goal.