Machine Learning Glossary

DNN

CNN (Convolutional Neural Network): A neural network with the addition of convolution operations. Here we explain what convolution is and why it is effective for image recognition. A CNN is a neural network that incorporates Convolution Layers, which are created by convolving information within a filter's region. Convolution Layers are created by applying a sliding filter, producing one layer per filter. By stacking these and connecting them with activation functions (ReLU, etc.), a network is constructed. Convolution enables region-based rather than point-based feature extraction, making the model robust to image translation and deformation. It also enables extraction of features like edges that cannot be detected at the point level.
RNN (Recurrent Neural Network): While CNNs handle two-dimensional rectangular image data, audio data is variable-length time-series data. To handle this variable-length data with neural networks, RNNs use a network structure where hidden layer values are fed back into the hidden layer. RNNs suffer from problems such as vanishing gradients or exploding computation when trying to use data from long time periods ago, limiting them to processing only short-term data.
LSTM: LSTM (Long Short-Term Memory) is a powerful model that overcomes the shortcomings of RNN and can learn from long-term time-series data. Although it was published in 1997, it has recently gained rapid attention along with the deep learning boom. It is being applied to natural language processing and has started to achieve significant results.
Autoencoder: The core of an autoencoder is dimensionality reduction. An autoencoder is a type of neural network designed to learn compressed feature representations of information. The compression process is called the encoder, and the reconstruction process is called the decoder. Using an autoencoder to pre-estimate the initial weights of a perceptron is called pre-training, but this is rarely used in practice anymore. While autoencoders aim to compress dimensions, their potential lies in the fact that they self-learn feature representations of the target through semi-supervised learning where input and output are the same. They are now also proving effective in generative models.
GAN: GAN (Generative Adversarial Network) is a cutting-edge generative model that has attracted tremendous attention. The basic idea of GAN is simple, so let's explain with an analogy. Imagine two characters: a counterfeiter who makes fake bills and a police officer, as illustrated in the figure. The counterfeiter creates fake bills that resemble real currency. The police officer tries to detect the fakes. Poorly made fakes are easily caught by the police, but as the counterfeiter's skills improve and the fakes become more sophisticated, the police also work harder to distinguish them. Through this mutual competition, the fake bills eventually become indistinguishable from real currency.
SNN (Spiking Neural Network): Unlike conventional neural networks, this model focuses on the internal potential of neurons rather than firing frequency. It represents analog quantities through pulse timing, achieving higher biological fidelity. Its asynchronous operation makes high-speed processing a challenge.
NNA (Neural Network Accelerator): Dedicated hardware. Primarily accelerates matrix multiply-accumulate operations.

Machine Learning Frameworks and Libraries

PyTorch: A paradigm that dynamically constructs neural networks using Define-by-Run.
PyTorch Lightning: A PyTorch wrapper that significantly reduces the amount of PyTorch coding required.
Caffe: [C++ (Python, MATLAB also possible?)]. Suited for image recognition? Implemented in C++ with GPU support, enabling high-speed computation. With the catchphrase "Caffe is a community," its development community actively updates GitHub and provides many sample codes, making it recommended for beginners. The convolutional neural network image classification model that won first place in the large-scale image recognition competition ILSVRC in 2012 is readily available. Caffe is primarily developed by BVLC, the computer vision and machine learning research center at UC Berkeley. Yahoo Japan became a sponsor of the center in June 2014, supporting its research including Caffe development.
Caffe2: C++. Facebook? Caffe is primarily developed by BVLC, the computer vision and machine learning research center at UC Berkeley.
Chainer: [Python]. A library specialized for describing neural networks. Made in Japan by Preferred Networks, it has a very easy-to-use syntax. Its most notable feature is the ability to construct flexible computation graphs through Define by Run. Although it uses Python, computations are performed by numpy (implemented in C). Demand is mainly domestic in Japan. Memory-efficient in a characteristically Japanese way?
Keras: (Being integrated into TensorFlow's high-level API?) It enables incredibly simple implementation by just stacking layers on top of each other. With Keras's emergence, many Chainer users seemed to quickly switch over. Keras appeared as a wrapper for tensor computation libraries like Theano and TensorFlow. Since its role is to simplify notation, easy computation graph construction is naturally expected. Not only is graph construction easy, but training code can be implemented in just one line by passing conditions like number of training epochs as arguments.
TensorFlow: [Python/C++]. In practice, TensorFlow's role is to efficiently compute multi-dimensional array calculations and execute them as computation graphs. Since neural networks can be described as computation graphs, TensorFlow naturally excels at deep learning. However, it is actually a more general-purpose computation framework, not limited to deep learning. While TensorFlow has been adding various functions to support deep learning implementation, it has not always excelled in ease of notation compared to other frameworks. Its Define and Run approach means computation graphs cannot be changed during training. It tends to be heavy as it uses all available GPU/memory.
Theano: [Python] University of Montreal. Its features include deep learning, matrix operations, runtime C code generation and compilation, automatic differentiation, and GPU processing (requires CUDA). In some cases, it can compute faster than the numerical computing library NumPy. There is a vast amount of deep learning tutorials available. Theano itself is a computation library supporting automatic partial differentiation and GPU, not a dedicated deep learning package. It is very useful for those who want to understand the theory and implement from scratch. Many libraries have been developed based on Theano.
TensorFlow XLA: XLA (Accelerated Linear Algebra). It analyzes TensorFlow graphs created by users at runtime using JIT compilation technology. It generates specialized graphs based on actual dimensions and types at runtime, combines multiple operations, and produces binary code that can execute efficiently on CPUs, GPUs, or custom accelerators (such as Google's TPUs). Using tfcompile, TensorFlow representations can be converted to CPU executable code (without requiring TensorFlow Runtime).
cuDNN: A deep learning library published by NVIDIA.
Core ML: Apple
ARMComputeLibrary
Sonnet: DeepMind's TensorFlow-based neural network library.
Protocol Buffers: Used for data exchange between nodes. Data is serialized. Also commonly used for neural network data. tflite reportedly uses FlatBuffers.
ONNX (Open Neural Network Exchange): An open-source format for AI models. Proposed by Microsoft and Facebook. Supports TensorFlow through converters. Does not include trained data???
NNEF (Neural Network Exchange Format): Version 1.0 (provisional specification) was announced by the Khronos Group on 2017/12/25. A format that absorbs differences in neural network file formats. The differences from ONNX are that it is text-based and led by a non-profit organization? Also works with OpenVX.
OpenVX: Khronos Group. Standardization of image recognition APIs. Along with OpenXR, covers VR, Vision, and NN-related areas.
NHWC: Num_samples x Height x Width x Channels data format. NHWC is the TensorFlow default and NCHW is the optimal format to use when training on NVIDIA GPUs using cuDNN. For performance, supporting both NHWC and NCHW may be desirable.

Machine Learning Metrics Terminology

http://www.procrasist.com/entry/ml-metrics The logloss (Logarithm Loss) for binary classification doesn't just look at the classification result but takes the average of log values of the probability of belonging to each class (the classifier's confidence), thus examining the process leading to classification.

Function approximation: Training a neural network to approximate the behavior of a function.

Accuracy: The proportion of all predictions that match the correct answers. Not typically used in research papers.
Precision: Among predictions labeled as positive, the proportion that are actually positive. A term from information retrieval, along with recall. Sometimes used with one metric fixed.
Recall: Among actual positives, the proportion predicted as positive.
F-measure: An evaluation metric for prediction accuracy. The harmonic mean of Precision and Recall. The name may originate from a misremembering of E-measure. https://ci.nii.ac.jp/naid/110002939532 Micro-average and macro-average of precision: Micro computes precision after summing all counts; macro averages the per-category precisions. In economics, "macro" looks at indicators per country (category), while "micro" ignores categories and looks at fine-grained units.
Confusion matrix: True/False Positive/Negative. In character recognition, the matrix can be large enough to truly represent confusing situations.
Root Mean Squared Error (RMSE): The squaring is related to standard deviation.
Coefficient of Determination: Also called R2. Ranges from 0 to 1, with values closer to 1 being better. When predictions are perfect, the numerator becomes 0. The denominator serves as normalization to 1.

Confusion Matrix

True/False are adjectives; note that samples with a positive ground truth are TP and FN. True Positive (TP): The number of actual positive samples correctly predicted as positive. False Positive (FP): The number of actual negative samples incorrectly predicted as positive. False Negative (FN): The number of actual positive samples incorrectly predicted as negative. True Negative (TN): The number of actual negative samples correctly predicted as negative. False Positive Rate: FP/(FP+TN) The "false positive rate" is the proportion of actual negatives that were incorrectly predicted as positive. (The denominator is the total number of actual negatives.) True Positive Rate: TP/(TP+FN) The "true positive rate" is the proportion of actual positives that were correctly predicted as positive. (The denominator is the total number of actual positives.)

ROC Curve

A plot with False Positive Rate on the horizontal axis and True Positive Rate on the vertical axis.

Characteristics of ROC curves: The ROC curve never decreases as you move to the right. A model that achieves a high true positive rate at a low false positive rate is better.
AUC (Area Under an ROC Curve): The area enclosed by the ROC curve, x-axis, and y-axis (the shaded region in the figure below) -- the larger this area, the better the model. A model with AUC close to 1 has high performance; when predictions are completely random, the AUC is 0.5, meaning the ROC curve is a straight line connecting the origin (0,0) and (1,1).

Lift Chart

There are multiple definitions of lift charts, and there are few references defining the version used in DataRobot (what is sometimes called a lift chart is actually a different thing called a cumulative response curve). Lift charts are used to measure the accuracy of predictive models. They allow you to visually and quickly assess how much discriminative or predictive power a model's predicted values have, and when comparing multiple models, which has better accuracy. Slope of the actual line -- generally, steeper is better. Closeness of predicted and actual -- generally, closer is better.

Also sometimes called Cumulative Response Curve. The difference from ROC is that while the vertical axis remains True Positive Rate (TPR), the horizontal axis uses Positive Prediction Rate.

Ensemble Learning

https://www.codexa.net/what-is-ensemble-learning/ A method of combining weak learners to create a high-performance learner. There are three main approaches. For classification, majority voting of each model's predictions is commonly used; for regression, the average value is typically used. - Bagging: A method that trains weak learners in parallel and combines them. bagging = bootstrap + aggregating. Random Forest is an example. It generally reduces the variance of model predictions. - Boosting: A method that trains weak learners sequentially, combining them to make them stronger. Each successive learner prioritizes correctly classifying the data misclassified by the previous learner. It generally reduces the bias of model predictions. XGBoost is an example. - Stacking: A method of stacking models; it is advanced and complex. When used properly, it can balance both bias and variance.

Weak learner (low-performance learner) f(x): Accuracy greater than 0.5 = better than randomly returning +1 or -1. A learner with accuracy less than 0.5 can be made into one with accuracy greater than 0.5 by inverting its classification results. Note: Complex learners can be used, but simple learners with low computational cost are commonly used (decision stumps, decision trees, etc.). Identical classifiers do not improve performance (diversity is needed).
Bootstrap: Randomly selecting N data points with replacement from a dataset of N items. A technique used in statistics for estimating population statistics. It creates many slightly different datasets.
Bias and Variance: Bias is the average error between actual and predicted values; smaller values mean less error between predicted and true values. Variance indicates how spread out the predictions are; smaller values mean less spread. Bias and variance have a trade-off relationship.

Boosting

Similar to bagging in that it extracts a subset of data, builds weak learners, and combines them at the end. The difference is that boosting uses the results of the previous iteration. Therefore, parallel processing is not possible.

AdaBoost: Updates by increasing the weights of misclassified values from the previous round. When "boosting" is mentioned without qualification, it often refers to this method.
MadaBoost / LogitBoost: Versions of AdaBoost with different loss functions. Appeared in "Machine Learning Illustrated" but rarely seen elsewhere.
Gradient Boosting: Boosting that uses gradient descent. Honestly, the differences from AdaBoost are not entirely clear.
GBDT (Gradient Boosting Decision Tree): Gradient boosting that uses decision trees as weak learners. Sounds impressive.
XGBoost: A fast C++ implementation of Gradient Boosting. Reportedly 10x faster than the traditionally used GBT.

Serialization Formats Used in Machine Learning

https://www.sambaiz.net/article/46/

MessagePack: Can be used like JSON, but is faster and smaller.
Protocol Buffers: A serialization format for communication and persistence that defines structure using an Interface Definition Language (IDL), developed by Google. Also used in gRPC. Message types are defined in proto files (proto3). Code for various languages can be generated from these files. Used in all of Google's services and TensorFlow via protobuf.
FlatBuffers: Google's serialization format for performance-critical applications like games. Does not require parsing or unpacking before accessing data, but requires per-object memory allocation. Schema files are written instead of proto files. Used in TensorFlow Lite.

Types of Learning

Supervised learning: Training performed with known outcomes.
Unsupervised learning: Training performed without known outcomes -- reading the structure of data -- has a data mining aspect.
Reinforcement learning: A mechanism where taking actions based on data produces evaluations of those actions, which are then used to improve decision-making. For example, Q-learning.
Semi-supervised learning: Combines supervised and unsupervised learning. Uses both labeled and unlabeled data for training. Semi-supervised learning is well-suited for generating approximate functions and classifiers, with the surprising benefit of lower data collection costs. Within semi-supervised learning, transduction is a learning method that only makes predictions for unlabeled data within the given dataset.
Transduction (transductive inference): Attempts to predict new outputs for specific and fixed (test) examples from observed specific (training) examples. Deduction, induction, and transduction. Deduction is about finding rules from inputs.
Learning to learn: Aims to improve overall decision-making by performing multiple decisions together.
Deep reinforcement learning: Deep learning that provides feedback to trials (reinforcement learning), as in autonomous driving.
Backpropagation: Short for "backwards propagation of errors." Errors (and learning) propagate from output nodes to earlier nodes. Technically, backpropagation calculates the gradient of the error with respect to the modifiable weights in the network. This gradient is most commonly used in stochastic gradient descent, a simple algorithm for minimizing error. For backpropagation, the network must have at least three layers (input, hidden, output). For hidden layers in multi-layer networks to represent meaningful functions, non-linear activation functions are required. A multi-layer network with linear activation functions is equivalent to a single-layer network. Common non-linear activation functions include logistic functions (especially sigmoid functions like tanh), softmax, and Gaussian functions, but max(x, 0) is now considered the best choice for hidden layer activation functions.
Gradient descent: A type of optimizer (optimization algorithm).
- Gradient Descent: Computes the sum of all errors before updating parameters. Slow computation.
- Stochastic Gradient Descent (SGD): Randomly selects one sample from training data, computes the error, and updates parameters. Fast computation.
- Mini-batch Stochastic Gradient Descent (Mini-batch SGD / MSGD): A middle ground between the above two -- randomly selects several data points from training data, computes the error, and updates parameters.
Optimization algorithms:
- "Scaled conjugate gradient." The assumptions justifying the use of conjugate gradient methods only apply to batch learning, so this method cannot be used for online or mini-batch learning.
- "Gradient descent." This method should be used for online or mini-batch learning. It can also be used for batch learning.
Epoch: "Learning rate decay (epoch)": When gradient descent is used in online or mini-batch learning, this is the number of epochs (p) or data passes of training samples needed to decrease the initial learning rate to the learning rate lower bound.
Activation function: A non-linear or identity function applied after a linear transformation in a neural network. Includes step function, linear combination, sigmoid function, softsign, softplus, ReLU (ramp function, max(0, x)).
Logistic regression: A type of statistical regression model for variables following a Bernoulli distribution. It is also a type of Generalized Linear Model (GLM) using logit as the link function. The model is equivalent to the simple perceptron published in 1958, but in scikit-learn, models that use stochastic gradient descent for optimization are called perceptrons, while those using coordinate descent or quasi-Newton methods are called logistic regression.
Batch learning: When there are N training data points x, batch learning uses all N data points, computes the average loss l for each data point, considers it as the overall loss L, defined as L(t,x;w)=1/N * sum(l(ti,xi;w)), and proceeds with training. Generally, training is stable, and since data fed into the neural network can be effectively computed simultaneously, it is faster than the following two methods.
Online learning: On the other hand, online learning (or "stochastic gradient method") randomly selects one xi from N data points x1, x2, ..., xN and directly uses the loss for that single data point as L: L(t,x;w) = l(ti,xi;w). This type of learning is typically unstable.
Mini-batch learning: A middle ground between batch learning and stochastic gradient method. From N total training data points, randomly select n (much less than N) data points and define the loss function as L(t,x;w) = 1/n * sum(l(ti,xi;w)), then train. In practice, this method is used in most situations, and when people say "stochastic gradient method" today, they often refer to mini-batch learning.
Label: The answer attached to data in supervised learning.

Hardware

TPU / Cloud TPU: The original TPU was inference-only. Cloud TPU added support for training.

NPU: Huawei AI Processor

Neural Engine: Apple A11 Bionic

Qualcomm uses DSP instead of GPU?

Pixel Visual Core: Designed by Google for Pixel 2. HDR+ quality. Pixel Visual Core contains 8 "Image Processing Units (IPUs)" designed by Google. These IPUs support Halide, an open-source image processing language developed at MIT, and Google's machine learning library TensorFlow, with plans to use them for non-image-processing functions in the future.

Sensors: ToF (Time of Flight): Measures distance from the round-trip time (phase difference of pulses) of emitted laser light to the target.

LiDAR (Light Detection and Ranging): A radar system using laser light. It continuously emits laser beams and measures the three-dimensional positions of reflection points at high density and low cost. Among the sensors on self-driving cars like Google Car, LiDAR stands out the most. It rotates on the roof like a police car's rotating light, making it very conspicuous. This LiDAR is a sensor that instantly reads and digitizes the 3D spatial structure of the surrounding 360 degrees. The principle is the same "time-of-flight" method used by the second-generation Kinect. LiDAR's advantage as a sensor for autonomous driving is not only its ability to quickly and accurately perceive the surrounding 3D space. Conveniently, LiDAR can also read lane markers such as white lines painted on roads, since road markings are often painted with highly reflective paint, allowing LiDAR to include the reflectivity difference between white lines and road surfaces in the 3D data.

Millimeter-wave radar: While millimeter-wave radar, which uses radio waves, has lower resolution than LiDAR, it can detect objects regardless of weather conditions, with a range of up to 250m.

TrueDepth: A camera system combining the front camera, infrared camera, ambient light sensor, proximity sensor, and dot projector located at the top front of the iPhone X. After confirming that the user is looking at the screen with eyes open, it projects over 30,000 infrared dots onto the face and analyzes this data to create a depth map and 2D infrared image of the face.

DMI (Distance Measuring Instrument): An odometer that measures how far a vehicle has traveled by counting tire rotations.

IMU (Inertial Measurement Unit): A 6-axis accelerometer for inertial navigation.

EDA: Stands for Exploratory Data Analysis. A research phase.

Vanilla LSTM: A classical LSTM.

Sparse Modeling, Regularization

Ridge regression, Lasso regression, Elastic Net https://aizine.ai/ridge-lasso-elasticnet/##toc8

Kernel Methods

Overview, two perspectives https://su-butsu-kikaigakusyuu.hatenablog.com/entry/2018/04/21/122541
Computing only the inner products of mappings using kernel functions => Kernel trick https://www.hellocybernetics.tech/entry/2016/08/09/051355

Gradient Methods

Coordinate descent and steepest descent
Mini-batch and stochastic gradient => May help avoid local minima

Singular Value Decomposition (SVD) and Low-Rank Matrix Approximation

By selecting only the top ranks from the singular value decomposition and reconstructing, you get a low-rank approximation. Is this different from low-rank matrix recovery? Low-rank matrix recovery also internally uses singular value decomposition. Given an unknown $N_1 \times N_2$ matrix $X_0$ , suppose we perform M linear observations, with the observation results: $y = (y_1, y_2, \ldots, y_M)^T, \quad y_i = \langle A_i, X_0 \rangle, \quad i = 1, 2, \ldots, M$ where $\langle X, Y \rangle = \text{tr}(X^T Y)$ . Assuming the observation matrices $\{A_1, A_2, \ldots, A_m\}$ are known, we express the observation using the linear operator $A: \mathbb{R}^{N_1 \times N_2} \to \mathbb{R}^M$ as $y = AX_0$ . When the rank $\text{rank } X_0$ of the unknown matrix $X_0$ is known to be small, the problem of matrix recovery is to estimate $X_0$ from $y$ and $A$ .

What is SVD?

Singular Value Decomposition (SVD) is mathematically a method for decomposing an M x N (M rows, N columns) matrix. In computer science, it is used to compress dimensions while preserving the features represented by matrices (in information retrieval, for example, matrices of word frequency per document) as much as possible. It is applied to image processing, natural language processing, search, and recommendation systems that deal with multi-dimensional features. Other dimensionality reduction techniques include Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF).

NMF (Non-negative Matrix Factorization)

https://abicky.net/2010/03/25/101719/ NMF is an algorithm that factorizes a non-negative matrix into two non-negative matrices. By doing so, it can clearly reveal the latent factors underlying the original matrix. https://qiita.com/kusano_t/items/4c0429778613bb4a336d

Maximum Likelihood Estimation (MLE)

A method in statistics for point estimation of parameters of the probability distribution that the given data follows.
Logistic regression is a regression that uses maximum likelihood estimation.
When the amount of data is sufficiently large in Bayesian estimation, or when the posterior distribution is unaffected by the prior distribution, it is equivalent to maximum likelihood estimation.
=> Estimation is done by assuming the data is distributed with a certain probability around the function estimated by methods like least squares.
The EM algorithm is one solution method.

Gaussian Mixture Model

In k-means, each data point belongs to exactly one cluster. So the indicator variable was strictly 0-1, like r1=(0,1,0). In Gaussian mixture models, each data point still belongs to clusters, but the indicator variable becomes a probability variable expressed as a latent variable.

IoU: IoU stands for Intersection over Union. "Over" means "divided by." It is a metric that indicates "how much two regions overlap."

https://mathwords.net/iou

Bayesian Network

Academically, it is a type of graphical model represented by a directed acyclic graph (DAG) where probability variables correspond to nodes and relationships between nodes correspond to directed edges. It can represent how prior beliefs about multiple events change with new evidence, which is why it is called a "Bayesian" network. https://www.synergy-marketing.co.jp/blog/introduction-bayesian-network

Markov Network

A Markov network is a Bayesian network with the arrows removed from the links.
While it has less expressive power than Bayesian networks, it has the advantage of easier inference computation, and is used in speech and image processing fields.

Naive Bayes (Simple Classifier)

A simple probabilistic classifier based on applying Bayes' theorem. Frequently used for text classification. Maximum likelihood estimation is used for parameter estimation.
"Naive" means "ridiculously simple." The simplest form of Bayesian network, with one parent and multiple independent children, is called Naive Bayes. It became famous for being used in spam filters at one point. Despite being simple, it has respectable classification performance, though it is also considered too simple to be called a Bayesian network. The true strength of Bayesian networks is demonstrated in complex models that consider multi-level parent-child relationships and interactions between children.

Decision Tree

A prediction model using leaves and roots.
Commonly used in data mining, where leaves represent classifications and branches represent the set of features leading to that classification.

K-Means

A distance-based clustering algorithm that assigns data to a predetermined number of clusters.
=> Unsupervised learning that clusters by distance from representative points, moves representative points to the centroid of the results, and repeats clustering.
K-nearest neighbors is supervised learning.

SVM

Support Vector Machine. A binary classifier.
Plots sample values on coordinates and finds the decision boundary that maximizes the distance from both the positive and negative example sets.

BERT (Bidirectional Encoder Representations from Transformers)

Achieved bidirectional learning through mask-based fill-in-the-blank tasks. Attention is everything? Compared to Word2Vec, "BERT's generalizability was achieved by capturing information from a representational space one dimension larger than the symbolic space."

Bandit Algorithm

An algorithm that efficiently performs exploration and exploitation to maximize profit over a fixed period.
(In the context of reinforcement learning) An algorithm used by agents for efficient learning.
For simplicity, values below are binary => hit or miss, clicked or not, purchased or not.

Is accuracy ever the right metric?

=> It is OK when classes are balanced. But macro/micro F-measure may be safer.

What is the difference between micro-average and macro-average?

=> Micro-average first ignores categories, computes overall precision and recall, then calculates F-measure from those. Macro-average first computes the F-measure for each category, then takes their simple average. Micro-average is heavily influenced by categories with many examples, while macro-average is heavily influenced by categories with few examples.

Harmonic Mean

The average speed when traveling the same distance at different speeds for each direction. An average of efficiency that accounts for cost? "The harmonic mean is the average efficiency when performing the same task at different efficiencies, and both precision and recall are efficiencies for obtaining relevant documents." When replacing the reciprocal of electrical resistance (1/R) with the per-resistor average (n/R), you get the harmonic mean.

What is the origin of the F-measure? Just the next letter after E (for error)?

http://d.hatena.ne.jp/sleepy_yoshi/20110410/p1

Geometric Mean: A Pythagorean mean. Finding the remaining side of a triangle. Square, add, and take the square root?

Arithmetic Mean

Induction and Deduction

Machine learning via regression and data is inductive. Processing through theory is deductive.

Expected value: The mean. Represented by E(). In statistics, "mean" refers to arithmetic mean. "Average" is used more loosely to include median, etc.

Mean Absolute Error (MAE): Also called mean absolute error. Both RMSE and MAE are commonly used error metrics. Since RMSE squares the values inside the root, it tends to treat outliers (large deviations) as larger errors compared to MAE.

Root Mean Squared Error (RMSE): Looks like a standard deviation formula. Also called mean square root error, RMS Error, RMSD (Root Mean Square Deviation), etc.

MSE (Mean Squared Error): Also called mean squared error. Looks like a variance formula. https://mathwords.net/rmsemae

MAPE (Mean Absolute Percentage Error): Mean absolute percentage error, mean absolute error rate. https://qiita.com/japanesebonobo/items/ad51cbbf36236b023df0

The logit function is the inverse of the logistic function.

Chaos theory: Verified using the Lyapunov exponent. If the Lyapunov exponent is positive, the system has sensitivity to initial conditions characteristic of chaotic systems, making prediction in that system extremely difficult.

Cold start problem

Terminology:

VAE: Variational Autoencoder

Likelihood: Plausibility; how "likely" something is.

Machine learning: A mechanism that has machines read large amounts of training data and create rules for inference such as classification and judgment. The process can be broadly divided into two parts: "training" and "inference."

Training/Learning: The education process of a neural network is called training? "Learning" is the broader term while "training" is more specific? Inference:

Pattern recognition: A collective term for speech recognition, image recognition, spatial recognition, etc.? A type of natural information processing. It selects and extracts targets with certain rules or meaning from data containing miscellaneous information such as images and audio. "Pattern recognition originated in engineering, and machine learning developed in the field of computer science, but they can be viewed as two aspects of the same field." "Pattern recognition originates from engineering, while machine learning arose from computer science. However, these research activities can be viewed as two aspects of the same field, and both have developed significantly over the past decade."

Formal Method: Formal methods express objects using mathematical concepts. Specifically, they represent objects primarily using "sets" and "relations" between sets and their elements. Then, when some operation is performed on multiple sets or relations, statements about whether those relations "hold" or "do not hold" are expressed as "logical propositions." The academic discipline that mathematically treats "logic" is called "mathematical logic," and formal methods have been primarily researched within it.

GEMM (General Matrix Multiply): Matrix multiplication of square matrices.

Tensor: A tensor is a generalization of linear quantities or linear geometric concepts that, given a choice of basis, can be represented as a multi-dimensional array. However, a tensor itself is an object defined independently of any particular coordinate system.

Tensor product: In mathematics, the tensor product is a concept for linearization that handles multilinearity in linear algebra, and is one of the operations that create new objects from known vector spaces, modules, and other objects. For any such objects, the tensor product is the most free bilinear multiplication.

Define by Run and Define and Run: The former allows flexible computation graph construction (Chainer, etc.), while the latter is more rigid (TensorFlow, etc.).

YOLO9000: Not a framework, but a state-of-the-art real-time object detection system capable of detecting over 9000 object categories.

seq2seq (Sequence to Sequence): Among text generation models using RNN-based neural networks, Sequence to Sequence (Seq2Seq) is well-known. Seq2Seq is a type of Encoder-Decoder model using RNNs, and can be used as a model for machine dialogue, machine translation, and more.

Word2Vec: A natural language processing technique proposed by Tomas Mikolov and colleagues at Google. It is a quantification method that represents words as vectors. For example, while the vocabulary used daily by Japanese speakers ranges from tens of thousands to hundreds of thousands of words, Word2Vec represents each word as a vector in approximately 200-dimensional space. This enables previously impossible or difficult-to-improve tasks such as calculating word similarity and performing addition/subtraction between words, effectively capturing the "meaning" of words.

Symbolic logic: Assuming we can analyze natural language sentences through morphological analysis (dividing into the smallest grammatical units) and syntactic analysis (analyzing grammatical structure according to a given language's grammar), can we discover the logical structure hidden in sentences? The discipline of converting sentences expressed in words into logical symbols and capturing sentences through their logical relationships is called "symbolic logic," originating in the 19th century. The most fundamental branch of logic is propositional logic. In propositional logic, propositions (individual events) are treated as minimal units, combined with logical operators (conjunction, disjunction, etc.) to construct complex propositions, and their validity (truth values) is computed. In propositional logic, valid reasoning cannot always be formalized. Predicate logic extends propositional logic by introducing the concept of predicates. In predicate logic, propositions are decomposed into predicates and their related data. => Pre-DL AI technology. But close to Formal Methods?

Deduction and induction: Deduction chains "because XX, therefore YY" logic to draw conclusions. Induction draws conclusions by summarizing similarities from many observations (facts).

Statistics

Image processing related: Occlusion: In 3D space, in addition to up-down and left-right relationships, there are front-back relationships where objects in front can hide objects behind them. This is called occlusion.

Robotics: Robot Operating System (ROS)

Training datasets: PASCAL VOC: An image dataset with annotations (bounding boxes). http://host.robots.ox.ac.uk/pascal/VOC/

COCO - Common Objects in Context: A dataset with semantic segmentation information (pixel-level object recognition information that is more detailed than typical annotations). http://cocodataset.org/

CIFAR-10 / CIFAR-100: Datasets published by Alex Krizhevsky's group (of AlexNet fame). Image datasets for 10-class and 100-class classification. http://www.cs.toronto.edu/~kriz/cifar.html

ImageNet: A dataset of over 14 million images with semantic tags, which began development after being presented at CVPR 2009 by a Princeton University team. About 1 million images include bounding box annotations. http://www.image-net.org/

DNN​

Machine Learning Frameworks and Libraries​

Machine Learning Metrics Terminology​

Confusion Matrix​

ROC Curve​

Lift Chart​

Ensemble Learning​

Boosting​

Serialization Formats Used in Machine Learning​

Types of Learning​

Hardware​

EDA: Stands for Exploratory Data Analysis. A research phase.​

Vanilla LSTM: A classical LSTM.​

Sparse Modeling, Regularization​

Kernel Methods​

Gradient Methods​

Singular Value Decomposition (SVD) and Low-Rank Matrix Approximation​

What is SVD?​

NMF (Non-negative Matrix Factorization)​

Maximum Likelihood Estimation (MLE)​

Gaussian Mixture Model​

IoU: IoU stands for Intersection over Union. "Over" means "divided by." It is a metric that indicates "how much two regions overlap."​

Bayesian Network​

Markov Network​

Naive Bayes (Simple Classifier)​

Decision Tree​

K-Means​

SVM​

BERT (Bidirectional Encoder Representations from Transformers)​

Bandit Algorithm​

Is accuracy ever the right metric?​

What is the difference between micro-average and macro-average?​

Harmonic Mean​

What is the origin of the F-measure? Just the next letter after E (for error)?​

Geometric Mean: A Pythagorean mean. Finding the remaining side of a triangle. Square, add, and take the square root?​

Arithmetic Mean​

Induction and Deduction​

DNN