CS530 homework 3, due 11/1
In ss04pnj.csv.gz (gzip format) and csv_pnj.zip (zip format) you'll find the New Jersey population records from the U.S. Census Bureau's 2004 American Community Survey Public Use Microdata Sample.
The data is in comma-separated-values format, which means that each line (except the first line) is a record (representing a person) whose attributes are delimited by commas.
The first line of the file lists the names of the attributes.
For this homework, we only care about the following attributes.
- AGEP
- Age, in years (0–92)
- JWMNP
- Travel time to work, in minutes (1–166) or blank (not a worker or worker who worked at home—just treat this as 0)
- SEX
- Sex, 1 (male) or 2 (female)
- RACAIAN
- 1 (American Indian or Alaska Native) or 0 (not)
- RACASN
- 1 (Asian) or 0 (not)
- RACBLK
- 1 (Black or African American) or 0 (not)
- RACNHPI
- 1 (Native Hawaiian or Other Pacific Islander) or 0 (not)
- RACWHT
- 1 (White) or 0 (not)
- HISP
- some number greater than 1 (Spanish/Hispanic/Latino) or 1 (not)
- MAR
- Marital status, 1 (married) or 2 (widowed) or 3 (divorced) or 4 (separated) or 5 (never married or under 15 years old)
Your mission is to use the data to construct a decision tree that predicts someone's marital status from the other attributes.
Each branch in the decision tree can test
- whether the person is younger or older than a given age;
- whether the person's travel time to work is longer or shorter than a given number or minutes;
- whether the person is female or male; or
- whether one of RACAIAN, RACASN, RACBLK, RACNHPI, RACWHT, and HISP is 1.
Write code to answer the following questions.
Please submit the code as well as your write-up.
- (40 points) Grow a decision tree from the first 5000 records in the data file as the training set, using the information-gain heuristic to choose the attributes to test.
Your decision tree should not be obviously bigger than necessary; that is, it should never branch to two identical subtrees.
How large is the tree—in other words, how many tests does it have?
Describe the tree, especially the short paths from the root to a leaf: do they make sense?
- (40 points) Prune the decision tree by cross validation, using the next 5000 records in the data file as the test set.
How large is the pruned tree?
Pick one subtree that was pruned away and explain the cross validation that led to its demise.
- (20 points) Compare the performance (% examples correctly classified) of the unpruned and pruned decision trees on the training set, the cross-validation test set, and the remainder of the data file as the final test set.
Did pruning reduce overfitting and improve generalization?
To help you check your code for part 1, below is the decision tree we grew using the first 10 records.
This is not the only correct tree because, for example, nobody among the first 10 records is 43 years old.
- If age ≤ 42, then predict “never married or under 15 years old”.
- If age ≤ 45, then predict “married”.
- Otherwise, predict “widowed”.
To help you check your code for parts 2 and 3, pruning the decision tree above using the next 10 records yields the following decision tree, which performs just as well (40% correct) on that test set.
- Predict “never married or under 15 years old”.