Learning Decision Trees Machine Learning Task
Description
The attached Zip folder has extra materials, but use whatever you need.
In this assignment, you’ll implement the classic decision tree algorithm. The amount of code you need to write is not huge, but there are some mental hurdles you’ll need to cross to completely understand what’s going on.
I have provided some skeleton code to get you started and guide you through the implementation. You are welcome to make any changes that you like, but please think carefully before doing so; this is designed to help make this easy to implement.
Datasets:
There are three datasets included with the assignment.
- Toy datasets:
- the tennis dataset
- the restaurant dataset.
- These are both useful for testing your code; they’re small, and you know what the correct answers are.
- Breast Cancer data. We used this dataset in assignment 1. This dataset contains medical records from a large number of women who have had breast cancer. Based on their characteristics, we would like to predict whether they will have a recurrence event. .
These files are in a format known as ARFF (Links to an external site.), which we saw in assignment 1. An ARFF file consists of three sections. The first is the comments, which begin with a ‘%’. The second is the @relation section, which describes each of the variables and the values they can take on. The third is the data itself, with each row representing one instance. I’ve provided you with a file called readARFF.py. to process this. If you want to resue your assignment 1 code, please do!
readARFF contains three functions:
- readArff, which takes as input a filehandle and returns two items:
- an attribute dictionary which makes an attribute name to a tuple indicating the column that attribute represents in the dataset and a list of the possible values it can take on.)
- a list of lists containing the data itself.
- getAttrList. This takes as input the attribute dictionary and returns a list of the attribute names in the same order as the columns they represent.
- ZeroR. Our old friend. Returns the most likely classification in a dataset. You may find it useful as a comparator, or in cases where you are unable to split your dataset.
A hint: list comprehensions are very helpful for this assignment. Often, you’ll need to pull out one or more columns from the data. So, for example, to get a list containing only the third column in a dataset where the last element is equal to some item ‘x’, you could do:
third = [d[2] for d in data if d[-1] == 'x']
Those of you who are familiar with pandas and would like to use dataframes for this assignment are welcome to do so; this is not required, though.
First, we’ll implement the basic decision tree algorithm.
- (10 points) The decision tree is easiest to build in a bottom-up fashion. To begin, we’ll need a method to compute entropy. it should take as input a list, such as [‘weak’, ‘strong’, ‘weak’, ‘weak’] and return a float indicating the entropy in this data. I’ve provided a function stub for you.
- Next, we’ll want to compute remainder. This will tell us, for a given attribute, how much information will remain if we choose to split on this attribute. I’ve written this one for you.
- (10 points) Once we know how to compute remainders, we need to be able to select an attribute. To do this, we just compute the remainder for each attribute and choose the one with the smallest remainder. (this will maximize information gain.) The function selectAttribute should take as input a list of lists, with each list being an instance. I’ve provided a stub for you.
Now we’re now ready to think about building a tree. A tree is a recursive data structure which consists of a parent node that has links to child nodes. I’ve provided a TreeNode class for you that does this. (you don’t need a separate Tree class.)
The TreeNode has the following data members:
- attribute: for non-leaf nodes, this will indicate which attribute this node tests. For leaves, it is empty.
- value. For leaf nodes, this indicates the classification at this leaf. For non-leaf nodes, it is empty.
- children. This is a dictionary that maps values of the attribute being tested at this node to the appropriate child, which is also a TreeNode.
It also has methods to print itself and to test whether it is a leaf.
(10 points) So we need a method that can build a tree. We will call this makeTree. It should take as input a dataset, a list of attribute names, the attribute dictionary, and a default value. It should work as follows:
- If the dataset contains zero entropy, we are done. Create a leaf node with value equal to the data’s classification and return it.
- If the dataset is empty, we have no data for this attribute value. Create a leaf node with the value set to the default value and return it.
- Otherwise, we have a non-leaf node. Use selectAttribute to find the attribute that maximizes gain. Then, remove that column for the dataset and the list of attributes and, for each value of that attribute, call makeTree with the appropriate subset of the data and add the TreeNode that is returned to the children, then return the TreeNode.
- (10 points) Now we know how to build a tree. We need to use it, though. To do this, you should implement the classify() method in TreeNode. classify should take as input the data instance to be classified and our attribute dictionary.This method is also recursive. If we are at a leaf, return the value of that leaf. Otherwise, check which attribute this node tests and follow the appropriate child. If there is no child for this value, return a default value.
Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."