CS530 homework 3, due 11/1

In ss04pnj.csv.gz (gzip format) and csv_pnj.zip (zip format) you'll find the New Jersey population records from the U.S. Census Bureau's 2004 American Community Survey Public Use Microdata Sample. The data is in comma-separated-values format, which means that each line (except the first line) is a record (representing a person) whose attributes are delimited by commas. The first line of the file lists the names of the attributes. For this homework, we only care about the following attributes.

AGEP
Age, in years (0–92)
JWMNP
Travel time to work, in minutes (1–166) or blank (not a worker or worker who worked at home—just treat this as 0)
SEX
Sex, 1 (male) or 2 (female)
RACAIAN
1 (American Indian or Alaska Native) or 0 (not)
RACASN
1 (Asian) or 0 (not)
RACBLK
1 (Black or African American) or 0 (not)
RACNHPI
1 (Native Hawaiian or Other Pacific Islander) or 0 (not)
RACWHT
1 (White) or 0 (not)
HISP
some number greater than 1 (Spanish/Hispanic/Latino) or 1 (not)
MAR
Marital status, 1 (married) or 2 (widowed) or 3 (divorced) or 4 (separated) or 5 (never married or under 15 years old)
Your mission is to use the data to construct a decision tree that predicts someone's marital status from the other attributes. Each branch in the decision tree can test Write code to answer the following questions. Please submit the code as well as your write-up.
  1. (40 points) Grow a decision tree from the first 5000 records in the data file as the training set, using the information-gain heuristic to choose the attributes to test. Your decision tree should not be obviously bigger than necessary; that is, it should never branch to two identical subtrees. How large is the tree—in other words, how many tests does it have? Describe the tree, especially the short paths from the root to a leaf: do they make sense?
  2. (40 points) Prune the decision tree by cross validation, using the next 5000 records in the data file as the test set. How large is the pruned tree? Pick one subtree that was pruned away and explain the cross validation that led to its demise.
  3. (20 points) Compare the performance (% examples correctly classified) of the unpruned and pruned decision trees on the training set, the cross-validation test set, and the remainder of the data file as the final test set. Did pruning reduce overfitting and improve generalization?
To help you check your code for part 1, below is the decision tree we grew using the first 10 records. This is not the only correct tree because, for example, nobody among the first 10 records is 43 years old.
  1. If age ≤ 42, then predict “never married or under 15 years old”.
  2. If age ≤ 45, then predict “married”.
  3. Otherwise, predict “widowed”.
To help you check your code for parts 2 and 3, pruning the decision tree above using the next 10 records yields the following decision tree, which performs just as well (40% correct) on that test set.
  1. Predict “never married or under 15 years old”.