Diabetic Nutritionist

Git:

Team:

Sarita Bhateja, Kavya Guruprasad and Nandini Goswami

Technologies:

Technologies: Python, MySql, Scikit learn

Problem Statement

According to statistics shared by CDC (Centers for Disease Control and Prevention), there are more than 29 millionAmericans who are suffering from diabetes and one in four don't know about it.Eighty Six million adults who constituteto more than one in three U.S. adults, have prediabetes.These numbers are alarming and needs increased focus on the dietary needs of these patients. We are trying to develop a system that will help the patient keep a check on their diet in a way that there blood sugar does not shoot up after consumption of a particular food item.Also, if required a better substitute could be suggested.

Dataset Description

For implemeting this system we have merged two data sets where one contained all nutrient values and the other contained Glycemicindex corresponding to food name.Our merged dataset contains food names, their nutrition values and the Glycemic index values. A higher Glycemic index value indicates higher carbohydrate level and hence should be avoided by patients. So, the glycemic index is the class label for our data set.

System Design

We have used the the following machine learning and artificial intelligence techniques to develop the system.

Linear regression:
The Linear Regression model helps to predict Glycemic index values depending on the food nutrients. The GI value ranges from 0-100. GI is classified as : Low GI(0-55),Medium GI(56-69) and High GI(70-100) We have considered applyinglinear regression model, as the GI has continuous values.Using GI value and net carbohydrate content we will calculate GL, that will in turn help us decide whether the food is appropriate for the patient or not. Using GI values we will classify foods into low, medium and high glycemic index food that will help us predict better food substitutes for the patient.
K-Nearest Neighbor:
K-means is an unsupervised learning algorithm which groups similar data. Food samples from the dataset that have low range of GI value is given as input to the k-NN algorithm. When a user inputs a food item, clustering will be done in a way to find similar nutrients, but with a lower GI index.The best result will be returned as per the least distance.The number of clusters is given by k. Elbow method is used to find the optimal value of k. In this method, we plot a graph with number of clusters on the x-axis and average distance to the centroid and look for an elbow in the graph.The point when the line starts to atten out is the k. We obtained the following graph. From the above graph, it is clear that we need to choose the value of k as 3. Giving the value of k as the input to our clustering algorithm, we can plot the graph of the data points and their assigned clusters.
Case Based Reasoning:
Case Base involves four steps:
Retrieve: We retrieve food samples from dataset and try to use its knowledge to solve new cases.
Reuse: Once a substitute is found, we try to reuse this knowledge again and again.
Revise: Once we nd a solution we try to conrm from the user whether the solution is good enough.
Retain: After the solution has been successfully adapted to the target problem, we store it into our database and use it for solving newer problems.
Thus, if a patient agrees to the food item suggested by the system, the combination will be saved in the knowledge base and if similar case occurs in future, the results will be fetched rom the knowledge base.

Evaluation and Results

We tried other ML algorithms before coming down to Linear Regression and K-means such as Decision tree but results were not satisfactory. When the linear regression was run on complete dataset, the results were not good enough to proceed. We ran Random Forest algorithm to identify the importance of all the feature and features with least importance were removed. This process was repeated over and over to identify when to stop removing the features from the dataset. This process improved the performance of Linear Regression. The results are not at par, however, good enough to get started with the project. The next algorithm K-means is helping in the adaptation part and providing substitute to the user. In this algorithm, number of clusters to be chosen is an important factor for the algorithm to converge. To identify the best possible number of clusters required, we ran the K-means with number of clusters ranging from 1-10 and used Elbow method to choose the best k (number of clusters). The results were better with number of clusters = 3. The limited number of data points is a weakness because K-means performs good with large dataset. The K-means algorithm with CBR for adaptation is working better than expected and results are quite interesting. When we tested the results against test data points with various scenarios, the results were fair. Also, we shared the results with domain expert to get insight into it. To her knowledge, the results are average and there is scope of improvement. The project has a lot of potential to leverage if given enough data points to analyse in depth. Also, domain expert has a huge role to play in this project.