CS Events

PhD Defense

Beyond Instance-level Reasoning in Object Pose Estimation and Tracking for Robotic Manipulation


Download as iCal file

Monday, May 02, 2022, 02:00pm - 04:00pm


It's also joinable online via Zoom:
Bowen Wen is inviting you to a scheduled Zoom meeting.

Topic: Bowen's PhD defense
Time: May 2, 2022 02:00 PM Eastern Time (US and Canada)

Join Zoom Meeting

Join by SIP
This email address is being protected from spambots. You need JavaScript enabled to view it.

Meeting ID: 952 2048 8975
Password: 887107
One tap mobile
+13126266799,,95220488975# US (Chicago)
+16465588656,,95220488975# US (New York)

Join By Phone
+1 312 626 6799 US (Chicago)
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Washington DC)
+1 346 248 7799 US (Houston)
+1 669 900 9128 US (San Jose)
+1 253 215 8782 US (Tacoma)
Meeting ID: 952 2048 8975
Find your local number: https://rutgers.zoom.us/u/adnNdXgMMU

Join by Skype for Business

If you have any questions, please <a href="https://it.rutgers.edu/help-support/">contact the Office of Information Technology Help Desk</a>

Speaker: Bowen Wen

Location : 1 Spring St, New Brunswick, NJ 08901 Rm 319


Kostas Bekris (advisor)

Abdeslam Boularias

Dimitris Metaxas

Shuran Song (Columbia University - external)

Event Type: PhD Defense

Abstract: This thesis deals with object pose estimation and tracking, and solve robot manipulation tasks. It aims to address uncertainty due to dynamics and generalize to novel object instances by reducing the dependency on either instance or category level 3D models. Robot object manipulation often requires reasoning about object poses given visual data. For instance, pose estimation can be used to initiate pick-and-drop manipulation and has been studied extensively. Purposeful manipulation, however, such as precise assembly or withinhand re-orientation, requires sustained reasoning of an object's state, since dynamic effects due to contacts and slippage, may alter the relative configuration between the object and the robotic hand. This motivates the temporal tracking of object poses over image sequences, which reduces computational latency, while maintaining or even enhancing pose quality relative to single-shot pose estimation. Most existing techniques in this domain assume instance-level 3D models. This complicates generalization to novel, unseen instances, and thus hinders deployment to novel environments. Even if instance-level 3D models are unavailable, however, it may be possible to access category-level models. Thus, it is desirable to learn category-level priors, which can be used for the visual understanding of novel, unknown object instances. In the most general case, where the robot has to deal with out-of-distribution instances or it cannot access category-level priors, object-agnostic perception methods are needed. Given this context, this thesis proposes a category-level representation, called NUNOCS, to unify the representation of various intra-class object instances and facilitate the transfer of category-level knowledge across such instances. This work also integrates the strengths of both modern deep learning as well as pose graph optimization to achieve generalizable object tracking in the SE(3) space, without needing either instance or category level 3D models. When instance-level object models are available, a synthetic data generation pipeline is developed to learn the relative motion along manifolds by reasoning over image residuals. This allows to achieve state-of-art SE(3) pose tracking results, while circumventing manual efforts in data collection or annotation. It also demonstrates that the developed solutions for object tracking provide efficient solutions to multiple manipulation challenges. Specifically, this thesis starts from a single-image object pose estimation approach that deals with severe occlusions during manipulation. It then moves to long-term object pose tracking via reasoning over image residuals between consecutive frames, while training exclusively over synthetic data. In the case of object tracking along a video sequence, the dependency on either instance-level or category-level CAD models is reduced via leveraging multi-view consistency, in the form of a memory-augmented pose graph optimization, to achieve spatial-temporal consistency. For initializing pose estimates in video sequences involving novel unseen objects, category-level priors are extracted by taking advantage of easily accessible virtual 3D model databases. Following these ideas, frameworks for category-level, task-relevant grasping, and vision-based, closed-loop manipulation are developed, which resolve complicated and high precision tasks. The learning process is scalable as the training is performed exclusively over synthetic data or through a robot's self-interaction process conducted solely in simulation. The proposed methods are evaluated first over public computer vision benchmarks, boosting the previous state-of-art tracking accuracy from 33.3% to 87.4% on the NOCS dataset, despite reducing dependency on category-level 3D models for training. When applied to real robotic setups, they significantly improve category-level manipulation performance, validating their effectiveness and robustness. In addition, this thesis unlocks and demonstrates multiple complex manipulation skills in open world environments. This is despite limited input assumptions, such as training solely over synthetic data, dealing with novel unknown objects, or learning from a single visual demonstration.


Rutgers University School of Arts and Sciences

Contact  Kostas Bekris