Develop a solution to predict the potability of the water based on the different water quality metrics.
The software will be implemnted in a hand device.
Water For All is a global NGO specialized on granting access to potable water in disadvantaged areas.
Water For All wants to create a hand device to analyze the potability of water.
The dataset shape is: 10 Columns & 3276 rows
| Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 | Feature 6 | Feature 7 | Feature 8 | Feature 9 |
|---|---|---|---|---|---|---|---|---|
| pH | Hardness | Solids | Organic_carbon | Turbidity | Chloramines | Sulfate | Conductivity | Trihalomethanes |
| Target |
|---|
| Potability |
Data Set found in Kaggle. Kaggle link
The Target Varible "Potability" is imbalance. It was decided to keep the imbalance to better predict the non potable water. Non potable water could be a risk for the water consumer
Is it possible to predict the potability of water based on the confidence intervals of the features for potable and non potable datapoints?
We raised this question to check if it was possible to decrease the hardware costs of the hand device by using filters for the samples instead of Classications models.
The Confidence Interval ranges for the potable and non potable datapoints overlap. It is not possible predict the potability of water based on filters.
The Scores which guide the decisions
-
Recall (sensitivity) is the ratio of correctly predicted positive (potable water) observations to the all observations in the actual class potable.
-
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
The above scores have been chosen because we want to have a low number of False Positives and high number of True Positives.
- The Scaler used for the model selection was Standard Scaler.
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | Model 7 |
|---|---|---|---|---|---|---|
| Decision Tree Scaled | Decision Tree | KNN | SVM | Random Forest | Random Forest Scaled | Logistic Regression |
- Decision Tree Scaled got the best scores for Recall and F1 Score.
- The confusion Matrixs for Decision Tree Scaled and Decision Tree not Scaled are very similar.
- KNN confusion Matrix gives a low number of False Posivite that was one of the model goals but the True Positives are too low. It will not be usefull for the Hand Device
To see the presentation, click in the below picture.






.png?raw=true)
.png?raw=true)
.png?raw=true)