Interpretability is of paramount importance to many applications of machine learning in science and technology, as the outcomes of models will be used to guide future experimentation and decision making. However, many useful methods are uninterpretable, providing no indication of how important different features are in determining the final result. This includes unsupervised methods, where a common element is the use of distance metrics. The distance between two instances is determined based on pairwise distance between the vectors of features, which can be dominated by one or more features that are very far apart. Distance metrics are widely used throughout machine learning, and are central to methods such as clustering. By recording the features that dominate each distance over an entire data set one can develop a feature importance profile that can be used to interpret unsupervised results. In this project you will devise a general way of measuring the influence of features in distance metrics to accumulate feature importance profiles, considering the Minkowski distances (including Manhattan, Euclidean and Chebychev), Mahalanobis distance, Hamming distance, Levenshtien distance and Cosine distance. Your approach will be tested in the context of clustering, but will be general enough for future use in relevant regressors and classifiers (such as k-Nearest Neighbours, KNN), manifold learning and image recognition. Multi-dimensional data sets will be provided.
To develop and test a general method for ranking the importance of coordinates in distance metrics
python programming and experience in data science and machine learning is essential (such as COMP3720, COMP4660, COMP4670, COMP6670, COMP8420). Familiarity with platforms such as scikit-learn is desirable.
machine learning, interpretability, explainable AI, clustering, python