Monday, July 26, 2010
Like a needle in a haystack
Predictive models are only effective if their predictions are useful, or in this case, correct most of the time. The past week was dedicated to evaluating the predictive capacity of our discriminant function developed last week. To do this, we ran our test set of data through the function to obtain a predicted MDI classification for each case in the test set. At first glance, our model appeared to work well, having predicted about the expected number of patients with higher MDIs versus lower MDIs. The next step was to compare the predicted MDI scores to the actual values in the test set. It was here that we ran into a problem: the MDI scores of the patients in the test set were missing. Either the patients had never returned for their follow up, or their appointment for evaluation had not yet taken place. To get around this we did the next best thing; we took a look at their charts to see if there was any mention that the patient was developmentally delayed, and classified them accordingly. Using this method of classification, our model only made a correct prediction about half of the time. This meant it was time to take a step back and re-evaluate which variables to use when we create our discriminant function. To do this, we've decided to cast a large net and create a cross-correlation matrix of all the variables we have access to in the training set. The idea is that we will then pick out the variables that have a high correlation with MDI score and then use those in our discriminant analysis. More number crunching to come next week!