01:198:210 Data management for Data Science
- Course Number: 01:198:210
- Instructor: Sesh Venugopal
- Course Type: Undergraduate
- Semester 1: SPRING
- Credits: 4
This course is intended to be the third offering in the sequence of courses for the SAS Data Science Certificate program. It is designed to provide students with the knowledge and skills needed to acquire and curate real word data, to explore the data to discover patterns and distributions, and to manage large datasets with databases.
Students will learn the minimal aspects of Python as needed to acquire and curate datasets. Much of their work will be done using Python libraries that deliver maximum benefit with minimal programming effort: to get data from various online data sources online, detect which aspects of data are uncurated or unreliable and understand why it is so, learn various domain independent and domain dependent ways to curate the data, and get the curated data into a form that can be explored, managed and analyzed. Students will also learn how to get datasets into database-ready form and do basic analysis of such datasets using relational databases and SQL, and NoSQL databases.
The course content is designed to be accessible to all SAS students regardless of their major. Although the course has CS 111 as one of the pre-requisites (for Computer Science students), it does not require students to have any programming experience, since in the other pre-requisite course (CS 142), students are only nominally exposed to R.
- Prerequisite Information:
Prerequisites: CS 142 (Data 101: Data Literacy) OR CS 111 (Introduction to Computer Science)
- Expected Work: To assess that students have acquired basic literacy in all the concepts, tools, and techniques they are taught, they will be given 6 quizzes periodically through the semester (typically bi-weekly), of which the lowest scoring quiz will be dropped for grading. Students are expected to work on 4-5 homework assignments through the semester. These assignments are intended for the student to learn by doing—data management is very much a hands-on experience. By doing the assignments, students will learn firsthand how to seek out various data sources; get data from such sources; curate, structure, visualize/explore/discover, and manage such data. Each assignment will typically have a 2 to 3-week window for submission. The assignments will be done either individually or in groups. Students will be required to do one course project, in groups. They will start working on the project midway through the semester. Each group will elect to pick one rich data source from which they will acquire data, curate/structure/explore the data, and present the results of their analysis and discovery. Each group will be required to use a particular subset of concepts and tools they have learned in the course that most effectively apply to their choice of data source and dataset. Student groups will submit their project in incremental stages, each of which will be graded with feedback to keep them on track for successful completion. The final exam will assess the student’s ability to put together the concepts and tools they have learned in the course in solving a range of very particular and localized data management problems arising in various real-world scenarios.
No required text.
Content for the Python part of the course will be made available via Jupyter notebooks, online documentation (usage samples and APIs accessible from within Jupyter notebooks, and other external sites), and various online data sources.