An algorithm is a step-by-step procedure designed to perform a specific task or solve a certain problem. It involves a sequence of instructions that lead to the outcome. Supervised machine learning algorithms, on the other hand, are slightly different. These are, by definition, techniques used to train models on labeled datasets where the input data is paired with the correct output. These algorithms learn to predict outcomes or classify data by finding relationships between the input features and the known labels. These supervised ML algorithms use three main supervised techniques: attribute importance, classification, and regression.
Decision Tree = a popular ML technique used for classification and regression tasks. It works by continuously splitting the data into subsets based on the values of input features.
Support Vector Machine (SVM) = a supervised ML algorithm used for both classification and regression tasks. However, it is mainly well known for its application to classification problems. Different versions of the SVM algorithm utilize various kernel techniques. SVM regression finds a function where most data points lie within a range around the regression function, while SVM classification separates classes with the widest possible margin.
Minimum Description Length (MDL) = a principle used in machine learning for model selection that uses the attribute importance technique. It finds a balance between the complexity of a model and its ability to accurately represent the data. Essentially, the MDL algorithm operates on the idea that the best model for a given dataset is the one that compresses the data the most when both the model and the data are taken into account.
Exponential Smoothing = a method used in both forecasting and time series analysis. It provides forecasts for time series data and assigns decreasing weights to older observations; this makes it particularly effective for datasets where recent data is more predictive than older data.
Explicit Semantic Analysis (ESA) = involves using predefined knowledge bases to represent the meaning of texts. It measures semantic similarity between texts by comparing their vector representations in a high-dimensional space. ESA is used for tasks such as information retrieval and document clustering, focusing on understanding the semantic content of text data.
Random Forest = an ensemble learning method used for both classification and regression tasks. It builds multiple decision trees during training and outputs the mode (for classification) or the average prediction (for regression) of the individual trees.
CUR Matrix Decomposition = instead of producing abstract components, the CUR matrix decomposition technique selects real columns (attributes) from the original dataset. It highlights characteristics that most accurately reflect the data’s structure by employing statistical leverage to identify significant properties. Because the chosen attributes are taken directly from the source dataset, CUR is especially useful for interpretability.
Naive Bayes = a probabilistic classifier based on Bayes’ theorem and the assumption that input attributes are conditionally independent. It works well in many real-world situations, particularly when dealing with categorical or high-dimensional data, despite this simplifying assumption. Naive Bayes is effectively implemented by Oracle for large datasets.
Generalized Linear Model (GLM) = GLM typically implements logistic regression. It models the relationship between the input attributes and the probability of class membership using a linear combination of predictors and a link function. GLMs are valued for their statistical interpretability and ability to provide coefficient estimates.
Neural Network = neural networks consist of interconnected layers of artificial neurons that learn complex, nonlinear relationships in data. Oracle neural networks are well suited to problems where class boundaries are not linearly separable. However, they are generally less interpretable than simpler models such as decision trees or GLMs.
XGBoost = XGBoost is a gradient-boosted decision tree algorithm that builds models sequentially, with each new tree focusing on correcting errors made by previous ones. It is known for its high predictive accuracy and efficiency. Oracle supports XGBoost for complex classification tasks where performance is critical.
Overall, supervised machine learning algorithms are a foundational part of the field, focusing on learning patterns and relationships from labeled data. These algorithms aim to build predictive models that can generalize from known examples to new, unseen data. Supervised ML algorithms find applications across a wide range of sectors, including IT, business, marketing, healthcare, and more.