@prefix : .
@prefix dcterms: .
@prefix owl: .
@prefix rdfs: .
@prefix skos: .
a owl:Ontology .
a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "A **(2+1)D Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a [3D Convolution](https://paperswithcode.com/method/3d-convolution) over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution." ;
skos:prefLabel "(2+1)D Convolution" .
:1-bitAdam a skos:Concept ;
dcterms:source ;
skos:definition "**1-bit Adam** is a [stochastic optimization](https://paperswithcode.com/methods/category/stochastic-optimization) technique that is a variant of [ADAM](https://paperswithcode.com/method/adam) with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term $\\mathbf{v}$ and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as $\\frac{\\text { magnitude of compensated gradient }}{\\text { magnitude of quantized gradient }}$. This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by $97 \\%$ and $94 \\%$ compared to the original float 32 and float 16 training, respectively." ;
skos:prefLabel "1-bit Adam" .
:1-bitLAMB a skos:Concept ;
dcterms:source ;
skos:definition """**1-bit LAMB** is a communication-efficient stochastic optimization technique which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. Learning from the insights behind [1-bit Adam](https://paperswithcode.com/method/1-bit-adam), it is a a 2-stage algorithm which uses [LAMB](https://paperswithcode.com/method/lamb) (warmup stage) to “pre-condition” a communication compressed momentum SGD algorithm (compression stage). At compression stage where original LAMB algorithm cannot be used to update the layerwise learning rates, 1-bit LAMB employs a novel way to adaptively scale layerwise learning rates based on information from both warmup and compression stages. As a result, 1-bit LAMB is able to achieve large batch optimization (LAMB)’s convergence speed under compressed communication.\r
\r
There are two major differences between 1-bit LAMB and the original LAMB:\r
\r
- During compression stage, 1-bit LAMB updates the layerwise learning rate based on a novel “reconstructed gradient” based on the compressed momentum. This makes 1-bit LAMB compatible with error compensation and be able to keep track of the training dynamic under compression.\r
- 1-bit LAMB also introduces extra stabilized soft thresholds when updating layerwise learning rate at compression stage, which makes training more stable under compression.""" ;
skos:prefLabel "1-bit LAMB" .
:1DCNN a skos:Concept ;
dcterms:source ;
skos:altLabel "1-Dimensional Convolutional Neural Networks" ;
skos:definition "1D Convolutional Neural Networks are similar to well known and more established 2D Convolutional Neural Networks. 1D Convolutional Neural Networks are used mainly used on text and 1D signals." ;
skos:prefLabel "1D CNN" .
:1cycle a skos:Concept ;
dcterms:source ;
skos:altLabel "1cycle learning rate scheduling policy" ;
skos:definition "" ;
skos:prefLabel "1cycle" .
:1x1Convolution a skos:Concept ;
dcterms:source ;
skos:definition """A **1 x 1 Convolution** is a [convolution](https://paperswithcode.com/method/convolution) with some special properties in that it can be used for dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after convolutions. It maps an input pixel with all its channels to an output pixel which can be squeezed to a desired output depth. It can be viewed as an [MLP](https://paperswithcode.com/method/feedforward-network) looking at a particular pixel location.\r
\r
Image Credit: [http://deeplearning.ai](http://deeplearning.ai)""" ;
skos:prefLabel "1x1 Convolution" .
:2DDWT a skos:Concept ;
dcterms:source ;
skos:altLabel "2D Discrete Wavelet Transform" ;
skos:definition "" ;
skos:prefLabel "2D DWT" .
:3-Augment a skos:Concept ;
dcterms:source ;
skos:definition "" ;
skos:prefLabel "3-Augment" .
:3DConvolution a skos:Concept ;
rdfs:seeAlso ;
skos:definition """A **3D Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) where the kernel slides in 3 dimensions as opposed to 2 dimensions with 2D convolutions. One example use case is medical imaging where a model is constructed using 3D image slices. Additionally video based data has an additional temporal dimension over images making it suitable for this module. \r
\r
Image: Lung nodule detection based on 3D convolutional neural networks, Fan et al""" ;
skos:prefLabel "3D Convolution" .
:3DDynamicSceneGraph a skos:Concept ;
dcterms:source ;
skos:definition "**3D Dynamic Scene Graph**, or **DSG**, is a representation that captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes." ;
skos:prefLabel "3D Dynamic Scene Graph" .
:3DIS a skos:Concept ;
dcterms:source ;
skos:altLabel "3-dimensional interaction space" ;
skos:definition """A **trainable 3D interaction space** aims to captures the associations between the triplet components and helps model the recognition of multiple triplets in the same frame.\r
\r
Source: [Nwoye et al.](https://arxiv.org/pdf/2007.05405v1.pdf)\r
\r
Image source: [Nwoye et al.](https://arxiv.org/pdf/2007.05405v1.pdf)""" ;
skos:prefLabel "3DIS" .
:3DResNet-RS a skos:Concept ;
dcterms:source ;
skos:definition """**3D ResNet-RS** is an architecture and scaling strategy for 3D ResNets for video recognition. The key additions are:\r
\r
- **3D ResNet-D stem**: The [ResNet-D](https://paperswithcode.com/method/resnet-d) stem is adapted to 3D inputs by using three consecutive [3D convolutional layers](https://paperswithcode.com/method/3d-convolution). The first convolutional layer employs a temporal kernel size of 5 while the remaining two convolutional layers employ a temporal kernel size of 1.\r
\r
- **3D Squeeze-and-Excitation**: [Squeeze-and-Excite](https://paperswithcode.com/method/squeeze-and-excitation-block) is adapted to spatio-temporal inputs by using a 3D [global average pooling](https://paperswithcode.com/method/global-average-pooling) operation for the squeeze operation. A SE ratio of 0.25 is applied in each 3D bottleneck block for all experiments.\r
\r
- **Self-gating**: A self-gating module is used in each 3D bottleneck block after the SE module.""" ;
skos:prefLabel "3D ResNet-RS" .
:3DSA a skos:Concept ;
dcterms:source ;
skos:altLabel "3 Dimensional Soft Attention" ;
skos:definition "" ;
skos:prefLabel "3D SA" .
:3DSSD a skos:Concept ;
dcterms:source ;
skos:definition "**3DSSD** is a point-based 3D single stage object detection detector. In this paradigm, all upsampling layers and refinement stage, which are indispensable in all existing point-based methods, are abandoned to reduce the large computation cost. The authors propose a fusion sampling strategy in the downsampling process to make detection on less representative points feasible. A delicate box prediction network including a candidate generation layer, an anchor-free regression head with a 3D center-ness assignment strategy is designed to meet the needs of accuracy and speed." ;
skos:prefLabel "3DSSD" .
a skos:Concept ;
dcterms:source ;
skos:altLabel "Four-dimensional A-star" ;
skos:definition "The aim of 4D A* is to find the shortest path between two four-dimensional (4D) nodes of a 4D search space - a starting node and a target node - as long as there is a path. It achieves both optimality and completeness. The former is because the path is shortest possible, and the latter because if the solution exists the algorithm is guaranteed to find it." ;
skos:prefLabel "4D A*" .
:A2C a skos:Concept ;
dcterms:source ;
skos:definition """**A2C**, or **Advantage Actor Critic**, is a synchronous version of the [A3C](https://paperswithcode.com/method/a3c) policy gradient method. As an alternative to the asynchronous implementation of A3C, A2C is a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before updating, averaging over all of the actors. This more effectively uses GPUs due to larger batch sizes.\r
\r
Image Credit: [OpenAI Baselines](https://openai.com/blog/baselines-acktr-a2c/)""" ;
skos:prefLabel "A2C" .
:A3C a skos:Concept ;
dcterms:source ;
skos:definition """**A3C**, **Asynchronous Advantage Actor Critic**, is a policy gradient algorithm in reinforcement learning that maintains a policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and an estimate of the value\r
function $V\\left(s\\_{t}; \\theta\\_{v}\\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t\\_{\\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\\nabla\\_{\\theta{'}}\\log\\pi\\left(a\\_{t}\\mid{s\\_{t}}; \\theta{'}\\right)A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ where $A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ is an estimate of the advantage function given by:\r
\r
$$\\sum^{k-1}\\_{i=0}\\gamma^{i}r\\_{t+i} + \\gamma^{k}V\\left(s\\_{t+k}; \\theta\\_{v}\\right) - V\\left(s\\_{t}; \\theta\\_{v}\\right)$$\r
\r
where $k$ can vary from state to state and is upper-bounded by $t\\_{max}$.\r
\r
The critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.\r
\r
Note that while the parameters $\\theta$ of the policy and $\\theta\\_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one [softmax](https://paperswithcode.com/method/softmax) output for the policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and one linear output for the value function $V\\left(s\\_{t}; \\theta\\_{v}\\right)$, with all non-output layers shared.""" ;
skos:prefLabel "A3C" .
:ABC a skos:Concept ;
dcterms:source ;
skos:altLabel "Approximate Bayesian Computation" ;
skos:definition """Class of methods in Bayesian Statistics where the posterior distribution is approximated over a rejection scheme on simulations because the likelihood function is intractable.\r
\r
Different parameters get sampled and simulated. Then a distance function is calculated to measure the quality of the simulation compared to data from real observations. Only simulations that fall below a certain threshold get accepted.\r
\r
Image source: [Kulkarni et al.](https://www.umass.edu/nanofabrics/sites/default/files/PDF_0.pdf)""" ;
skos:prefLabel "ABC" .
:ABCNet a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Bezier-Curve Network" ;
skos:definition "**Adaptive Bezier-Curve Network**, or **ABCNet**, is an end-to-end framework for arbitrarily-shaped scene text spotting. It adaptively fits arbitrary-shaped text by a parameterized bezier curve. It also utilizes a feature alignment layer, [BezierAlign](https://paperswithcode.com/method/bezieralign), to calculate convolutional features of text instances in curved shapes. These features are then passed to a light-weight recognition head." ;
skos:prefLabel "ABCNet" .
:ACER a skos:Concept ;
dcterms:source ;
skos:definition """**ACER**, or **Actor Critic with Experience Replay**, is an actor-critic deep reinforcement learning agent with [experience replay](https://paperswithcode.com/method/experience-replay). It can be seen as an off-policy extension of [A3C](https://paperswithcode.com/method/a3c), where the off-policy estimator is made feasible by:\r
\r
- Using [Retrace](https://paperswithcode.com/method/retrace) Q-value estimation.\r
- Using truncated importance sampling with bias correction.\r
- Using a trust region policy optimization method.\r
- Using a [stochastic dueling network](https://paperswithcode.com/method/stochastic-dueling-network) architecture.""" ;
skos:prefLabel "ACER" .
:ACGPN a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Content Generating and Preserving Network" ;
skos:definition """**ACGPN**, or **Adaptive Content Generating and Preserving Network**, is a [generative adversarial network](https://www.paperswithcode.com/method/category/generative-adversarial-network) for virtual try-on clothing applications. \r
\r
In Step I, the Semantic Generation Module (SGM) takes the target clothing image $\\mathcal{T}\\_{c}$, the pose map $\\mathcal{M}\\_{p}$, and the fused body part mask $\\mathcal{M}^{F}$ as the input to predict the semantic layout and to output the synthesized body part mask $\\mathcal{M}^{S}\\_{\\omega}$ and the target clothing mask $\\mathcal{M}^{S\\_{c}$.\r
\r
In Step II, the Clothes Warping Module (CWM) warps the target clothing image to $\\mathcal{T}^{R}\\_{c}$ according to the predicted semantic layout, where a second-order difference constraint is introduced to stabilize the warping process. \r
\r
In Steps III and IV, the Content Fusion Module (CFM) first produces the composited body part mask $\\mathcal{M}^{C}\\_{\\omega}$ using the original clothing mask $\\mathcal{M}\\_{c}$, the synthesized clothing mask $\\mathcal{M}^{S}\\_{c}$, the body part mask $\\mathcal{M}\\_{\\omega}$, and the synthesized body part mask $\\mathcal{M}\\_{\\omega}^{S}$, and then exploits a fusion network to generate the try-on images $\\mathcal{I}^{S}$ by utilizing the information $\\mathcal{T}^{R}\\_{c}$, $\\mathcal{M}^{S}\\_{c}$, and the body part image $I\\_{\\omega}$ from previous steps.""" ;
skos:prefLabel "ACGPN" .
:ACTKR a skos:Concept ;
dcterms:source ;
skos:definition """**ACKTR**, or **Actor Critic with Kronecker-factored Trust Region**, is an actor-critic method for reinforcement learning that applies [trust region optimization](https://paperswithcode.com/method/trpo) using a recently proposed Kronecker-factored approximation to the curvature. The method extends the framework of natural policy gradient and optimizes both the actor and the critic using Kronecker-factored approximate\r
curvature (K-FAC) with trust region.""" ;
skos:prefLabel "ACTKR" .
:ADELE a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Early-Learning Correction" ;
skos:definition "Adaptive Early-Learning Correction for Segmentation from Noisy Annotations" ;
skos:prefLabel "ADELE" .
:ADMM a skos:Concept ;
skos:altLabel "Alternating Direction Method of Multipliers" ;
skos:definition """The **alternating direction method of multipliers** (**ADMM**) is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle. It takes the form of a decomposition-coordination procedure, in which the solutions to small\r
local subproblems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization. It turns out to be equivalent or closely related to many other algorithms\r
as well, such as Douglas-Rachford splitting from numerical analysis, Spingarn’s method of partial inverses, Dykstra’s alternating projections method, Bregman iterative algorithms for l1 problems in signal processing, proximal methods, and many others.\r
\r
Text Source: [https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf](https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf)\r
\r
Image Source: [here](https://www.slideshare.net/derekcypang/alternating-direction)""" ;
skos:prefLabel "ADMM" .
:AE a skos:Concept ;
skos:altLabel "Autoencoders" ;
skos:definition """An **autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. \r
\r
Extracted from: [Wikipedia](https://en.wikipedia.org/wiki/Autoencoder)\r
\r
Image source: [Wikipedia](https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder_schema.png)""" ;
skos:prefLabel "AE" .
:AEDA a skos:Concept ;
dcterms:source ;
skos:altLabel "An Easier Data Augmentation" ;
skos:definition "**AEDA**, or **An Easier Data Augmentation**, is a type of data augmentation technique for text classification which includes only the insertion of various punctuation marks into the input sequence. AEDA preserves all the input information and does not mislead the network since it keeps the word order intact while changing their positions in that the words are shifted to the right." ;
skos:prefLabel "AEDA" .
:AGCN a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Graph Convolutional Neural Networks" ;
skos:definition """AGCN is a novel spectral graph convolution network that feed on original data of diverse graph structures.\r
\r
Image credit: [Adaptive Graph Convolutional Neural Networks](https://arxiv.org/pdf/1801.03226.pdf)""" ;
skos:prefLabel "AGCN" .
:AHAF a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Hybrid Activation Function" ;
skos:definition "Trainable activation function as a sigmoid-based generalization of ReLU, Swish and SiLU." ;
skos:prefLabel "AHAF" .
:ALAE a skos:Concept ;
dcterms:source ;
skos:altLabel "Adversarial Latent Autoencoder" ;
skos:definition "**ALAE**, or **Adversarial Latent Autoencoder**, is a type of autoencoder that attempts to overcome some of the limitations of[ generative adversarial networks](https://paperswithcode.com/paper/generative-adversarial-networks). The architecture allows the latent distribution to be learned from data to address entanglement (A). The output data distribution is learned with an adversarial strategy (B). Thus, we retain the generative properties of GANs, as well as the ability to build on the recent advances in this area. For instance, we can include independent sources of stochasticity, which have proven essential for generating image details, or can leverage recent improvements on GAN loss functions, regularization, and hyperparameters tuning. Finally, to implement (A) and (B), AE reciprocity is imposed in the latent space (C). Therefore, we can avoid using reconstruction losses based on simple $\\mathcal{l}\\){2}$ norm that operates in data space, where they are often suboptimal, like for the image space. Since it works on the latent space, rather than autoencoding the data space, the approach is named Adversarial Latent Autoencoder (ALAE)." ;
skos:prefLabel "ALAE" .
:ALBEF a skos:Concept ;
dcterms:source ;
skos:definition "ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data." ;
skos:prefLabel "ALBEF" .
:ALBERT a skos:Concept ;
dcterms:source ;
skos:definition """**ALBERT** is a [Transformer](https://paperswithcode.com/method/transformer) architecture based on [BERT](https://paperswithcode.com/method/bert) but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. \r
\r
Additionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.""" ;
skos:prefLabel "ALBERT" .
:ALCN a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Adaptive Locally Connected Neuron" ;
skos:definition """The **Adaptive Locally Connected Neuron (ALCN)** is a topology aware, and locally adaptive -focusing neuron:\r
\r
$$a = f\\:\\Bigg( \\sum_{i=1}^{m} w_{i}\\phi\\left( \\tau\\left(i\\right),\\Theta\\right) x_{i} + b \\Bigg) %f\\:\\Bigg(\\mathbf{X(W \\circ \\Phi) + b} \\Bigg) $$""" ;
skos:prefLabel "ALCN" .
:ALDA a skos:Concept ;
dcterms:source ;
skos:definition "**Adversarial-Learned Loss for Domain Adaptation** is a method for domain adaptation that combines adversarial learning with self-training. Specifically, the domain discriminator has to produce different corrected labels for different domains, while the feature generator aims to confuse the domain discriminator. The adversarial process finally leads to a proper confusion matrix on the target domain. In this way, ALDA takes the strengths of domain-adversarial learning and self-training based methods." ;
skos:prefLabel "ALDA" .
:ALDEN a skos:Concept ;
dcterms:source ;
skos:definition """**ALDEN**, or **Active Learning with DivErse iNterpretations**, is an active learning approach for text classification. With local interpretations in DNNs, ALDEN identifies linearly separable regions of samples. Then, it selects samples according to their diversity of local interpretations and queries their labels.\r
\r
Specifically, we first calculate the local interpretations in DNN for each sample as the gradient backpropagated from the final\r
predictions to the input features. Then, we use the most diverse interpretation of words in a sample to measure its diverseness. Accordingly, we select unlabeled samples with the maximally diverse interpretations for labeling and retrain the model with these\r
labeled samples.""" ;
skos:prefLabel "ALDEN" .
:ALI a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Adversarially Learned Inference" ;
skos:definition """**Adversarially Learned Inference (ALI)** is a generative modelling approach that casts the learning of both an inference machine (or encoder) and a deep directed generative model (or decoder) in an GAN-like adversarial framework. A discriminator is trained to discriminate joint samples of the data and the corresponding latent variable from the encoder (or approximate posterior) from joint samples from the decoder while in opposition, the encoder and the decoder are trained together to fool the discriminator. Not is the discriminator asked to distinguish synthetic samples from real data, but it is required it to distinguish between two joint distributions over the data space and the latent variables.\r
\r
An ALI differs from a [GAN](https://paperswithcode.com/method/gan) in two ways:\r
\r
- The generator has two components: the encoder, $G\\_{z}\\left(\\mathbf{x}\\right)$, which maps data samples $x$ to $z$-space, and the decoder $G\\_{x}\\left(\\mathbf{z}\\right)$, which maps samples from the prior $p\\left(\\mathbf{z}\\right)$ (a source of noise) to the input space.\r
- The discriminator is trained to distinguish between joint pairs $\\left(\\mathbf{x}, \\tilde{\\mathbf{z}} = G\\_{\\mathbf{x}}\\left(\\mathbf{x}\\right)\\right)$ and $\\left(\\tilde{\\mathbf{x}} =\r
G\\_{x}\\left(\\mathbf{z}\\right), \\mathbf{z}\\right)$, as opposed to marginal samples $\\mathbf{x} \\sim q\\left(\\mathbf{x}\\right)$ and $\\tilde{\\mathbf{x}} ∼ p\\left(\\mathbf{x}\\right)$.""" ;
skos:prefLabel "ALI" .
:ALIGN a skos:Concept ;
dcterms:source ;
skos:definition "In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries." ;
skos:prefLabel "ALIGN" .
:ALIS a skos:Concept ;
dcterms:source ;
skos:altLabel "Aligning Latent and Image Spaces" ;
skos:definition "An infinite image generator which is based on a patch-wise, periodically equivariant generator." ;
skos:prefLabel "ALIS" .
:ALP-GMM a skos:Concept ;
dcterms:source ;
skos:altLabel "Absolute Learning Progress and Gaussian Mixture Models for Automatic Curriculum Learning" ;
skos:definition "ALP-GMM is is an algorithm that learns to generate a learning curriculum for black box reinforcement learning agents, whereby it sequentially samples parameters controlling a stochastic procedural generation of tasks or environments." ;
skos:prefLabel "ALP-GMM" .
:ALQandAMQ a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Gradient Quantization with Adaptive Levels/Multiplier" ;
skos:definition "Many communication-efficient variants of [SGD](https://paperswithcode.com/method/sgd) use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters." ;
skos:prefLabel "ALQ and AMQ" .
:ALS a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Label Smoothing" ;
skos:definition "" ;
skos:prefLabel "ALS" .
:ALiBi a skos:Concept ;
dcterms:source ;
skos:altLabel "Attention with Linear Biases" ;
skos:definition """**ALiBi**, or **Attention with Linear Biases**, is a [positioning method](https://paperswithcode.com/methods/category/position-embeddings) that allows [Transformer](https://paperswithcode.com/methods/category/transformers) language models to consume, at inference time, sequences which are longer than the ones they were trained on. \r
\r
ALiBi does this without using actual position embeddings. Instead, computing the attention between a certain key and query, ALiBi penalizes the attention value that that query can assign to the key depending on how far away the key and query are. So when a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high. \r
\r
This method was motivated by the simple reasoning that words that are close-by matter much more than ones that are far away.\r
\r
This method is as fast as the sinusoidal or absolute embedding methods (the fastest positioning methods there are). It outperforms those methods and Rotary embeddings when evaluating sequences that are longer than the ones the model was trained on (this is known as extrapolation).""" ;
skos:prefLabel "ALiBi" .
:AM a skos:Concept ;
dcterms:source ;
skos:altLabel "Attention Model" ;
skos:definition "" ;
skos:prefLabel "AM" .
:AMP a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Adversarial Model Perturbation" ;
skos:definition "Based on the understanding that the flat local minima of the empirical risk cause the model to generalize better. Adversarial Model Perturbation (AMP) improves generalization via minimizing the **AMP loss**, which is obtained from the empirical risk by applying the **worst** norm-bounded perturbation on each point in the parameter space." ;
skos:prefLabel "AMP" .
:AMSBound a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**AMSBound** is a variant of the [AMSGrad](https://paperswithcode.com/method/amsgrad) stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AMSBound can be regarded as an adaptive method at the beginning of training, and it gradually and smoothly transforms to [SGD](https://paperswithcode.com/method/sgd) (or with momentum) as time step increases. \r
\r
$$ g\\_{t} = \\nabla{f}\\_{t}\\left(x\\_{t}\\right) $$\r
\r
$$ m\\_{t} = \\beta\\_{1t}m\\_{t-1} + \\left(1-\\beta\\_{1t}\\right)g\\_{t} $$\r
\r
$$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r
\r
$$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) \\text{ and } V\\_{t} = \\text{diag}\\left(\\hat{v}\\_{t}\\right) $$\r
\r
$$ \\eta = \\text{Clip}\\left(\\alpha/\\sqrt{V\\_{t}}, \\eta\\_{l}\\left(t\\right), \\eta\\_{u}\\left(t\\right)\\right) \\text{ and } \\eta\\_{t} = \\eta/\\sqrt{t} $$\r
\r
$$ x\\_{t+1} = \\Pi\\_{\\mathcal{F}, \\text{diag}\\left(\\eta\\_{t}^{-1}\\right)}\\left(x\\_{t} - \\eta\\_{t} \\odot m\\_{t} \\right) $$\r
\r
Where $\\alpha$ is the initial step size, and $\\eta_{l}$ and $\\eta_{u}$ are the lower and upper bound functions respectively.""" ;
skos:prefLabel "AMSBound" .
:AMSGrad a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**AMSGrad** is a stochastic optimization method that seeks to fix a convergence issue with [Adam](https://paperswithcode.com/method/adam) based optimizers. AMSGrad uses the maximum of past squared gradients \r
$v\\_{t}$ rather than the exponential average to update the parameters:\r
\r
$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r
\r
$$v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r
\r
$$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) $$\r
\r
$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{\\hat{v}_{t}} + \\epsilon}m\\_{t}$$""" ;
skos:prefLabel "AMSGrad" .
:APPO a skos:Concept ;
dcterms:source ;
skos:altLabel "Asynchronous Proximal Policy Optimization" ;
skos:definition "" ;
skos:prefLabel "APPO" .
:ARCH a skos:Concept ;
dcterms:source ;
skos:altLabel "Animatable Reconstruction of Clothed Humans" ;
skos:definition "**Animatable Reconstruction of Clothed Humans** is an end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features." ;
skos:prefLabel "ARCH" .
:ARM-Net a skos:Concept ;
dcterms:source ;
skos:definition "ARM-Net is an adaptive relation modeling network tailored for structured data, and a lightweight framework ARMOR based on ARM-Net for relational data analytics. The key idea is to model feature interactions with cross features selectively and dynamically, by first transforming the input features into exponential space, and then determining the interaction order and interaction weights adaptively for each cross feature. The authors propose a novel sparse attention mechanism to dynamically generate the interaction weights given the input tuple, so that we can explicitly model cross features of arbitrary orders with noisy features filtered selectively. Then during model inference, ARM-Net can specify the cross features being used for each prediction for higher accuracy and better interpretability." ;
skos:prefLabel "ARM-Net" .
:ARMA a skos:Concept ;
dcterms:source ;
skos:altLabel "ARMA GNN" ;
skos:definition "The ARMA GNN layer implements a rational graph filter with a recursive approximation." ;
skos:prefLabel "ARMA" .
:ARShoe a skos:Concept ;
dcterms:source ;
skos:definition "**ARShoe** is a multi-branch network for pose estimation and segmentation tackling the \"try-on\" problem for augmented reality shoes. Consisting of an encoder and a decoder, the multi-branch network is trained to predict keypoints [heatmap](https://paperswithcode.com/method/heatmap) (heatmap), [PAFs](https://paperswithcode.com/method/pafs) heatmap (pafmap), and segmentation results (segmap) simultaneously. Post processes are then performed for a smooth and realistic virtual try-on." ;
skos:prefLabel "ARShoe" .
:ARiA a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Richard's Curve Weighted Activation" ;
skos:definition "This work introduces a novel activation unit that can be efficiently employed in deep neural nets (DNNs) and performs significantly better than the traditional Rectified Linear Units ([ReLU](https://paperswithcode.com/method/relu)). The function developed is a two parameter version of the specialized Richard's Curve and we call it Adaptive Richard's Curve weighted Activation (ARiA). This function is non-monotonous, analogous to the newly introduced [Swish](https://paperswithcode.com/method/swish), however allows a precise control over its non-monotonous convexity by varying the hyper-parameters. We first demonstrate the mathematical significance of the two parameter ARiA followed by its application to benchmark problems such as MNIST, CIFAR-10 and CIFAR-100, where we compare the performance with ReLU and Swish units. Our results illustrate a significantly superior performance on all these datasets, making ARiA a potential replacement for ReLU and other activations in DNNs." ;
skos:prefLabel "ARiA" .
:ASAF a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Spline Activation Function" ;
skos:definition """Stefano Guarnieri, Francesco Piazza, and Aurelio Uncini \r
"Multilayer Feedforward Networks with Adaptive Spline Activation Function," \r
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999\r
\r
Abstract — In this paper, a new adaptive spline activation function neural network (ASNN) is presented. Due to the ASNN’s high representation capabilities, networks with a small number of interconnections can be trained to solve both pattern recognition and data processing real-time problems. The main idea is to use a Catmull–Rom cubic spline as the neuron’s activation function, which ensures a simple structure suitable for both software and hardware implementation. Experimental results demonstrate improvements in terms of generalization capability\r
and of learning speed in both pattern recognition and data processing tasks.\r
Index Terms— Adaptive activation functions, function shape autotuning, generalization, generalized sigmoidal functions, multilayer\r
perceptron, neural networks, spline neural networks.""" ;
skos:prefLabel "ASAF" .
:ASFF a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Adaptively Spatial Feature Fusion" ;
skos:definition """**ASFF**, or **Adaptively Spatial Feature Fusion**, is a method for pyramidal feature fusion. It learns the way to spatially filter conflictive information to suppress inconsistency across different feature scales, thus improving the scale-invariance of features. \r
\r
ASFF enables the network to directly learn how to spatially filter features at other levels so that only useful information is kept for combination. For the features at a certain level, features of other levels are first integrated and resized into the same resolution and then trained to find the optimal fusion. At each spatial location, features at different levels are fused adaptively, *i.e.*, some features may be filter out as they carry contradictory information at this location and some may dominate with more discriminative clues. ASFF offers several advantages: (1) as the operation of searching the optimal fusion is differential, it can be conveniently learned in back-propagation; (2) it is agnostic to the backbone model and it is applied to single-shot detectors that have a feature pyramid structure; and (3) its implementation is simple and the increased computational cost is marginal.\r
\r
Let $\\mathbf{x}_{ij}^{n\\rightarrow l}$ denote the feature vector at the position $(i,j)$ on the feature maps resized from level $n$ to level $l$. Following a feature resizing stage, we fuse the features at the corresponding level $l$ as follows:\r
\r
$$\r
\\mathbf{y}\\_{ij}^l = \\alpha^l_{ij} \\cdot \\mathbf{x}\\_{ij}^{1\\rightarrow l} + \\beta^l_{ij} \\cdot \\mathbf{x}\\_{ij}^{2\\rightarrow l} +\\gamma^l\\_{ij} \\cdot \\mathbf{x}\\_{ij}^{3\\rightarrow l},\r
$$\r
\r
where $\\mathbf{y}\\_{ij}^l$ implies the $(i,j)$-th vector of the output feature maps $\\mathbf{y}^l$ among channels. $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ refer to the spatial importance weights for the feature maps at three different levels to level $l$, which are adaptively learned by the network. Note that $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ can be simple scalar variables, which are shared across all the channels. Inspired by acnet, we force $\\alpha^l\\_{ij}+\\beta^l\\_{ij}+\\gamma^l\\_{ij}=1$ and $\\alpha^l\\_{ij},\\beta^l\\_{ij},\\gamma^l\\_{ij} \\in [0,1]$, and \r
\r
$$\r
\\alpha^l_{ij} = \\frac{e^{\\lambda^l\\_{\\alpha\\_{ij}}}}{e^{\\lambda^l\\_{\\alpha_{ij}}} + e^{\\lambda^l\\_{\\beta_{ij}\r
}} + e^{\\lambda^l\\_{\\gamma_{ij}}}}.\r
$$\r
\r
Here $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ are defined by using the [softmax](https://paperswithcode.com/method/softmax) function with $\\lambda^l\\_{\\alpha_{ij}}$, $\\lambda^l\\_{\\beta_{ij}}$ and $\\lambda^l\\_{\\gamma_{ij}}$ as control parameters respectively. We use $1\\times1$ [convolution](https://paperswithcode.com/method/convolution) layers to compute the weight scalar maps $\\mathbf{\\lambda}^l_\\alpha$, $\\mathbf{\\lambda}^l\\_\\beta$ and $\\mathbf{\\lambda}^l\\_\\gamma$ from $\\mathbf{x}^{1\\rightarrow l}$, $\\mathbf{x}^{2\\rightarrow l}$ and $\\mathbf{x}^{3\\rightarrow l}$ respectively, and they can thus be learned through standard back-propagation.\r
\r
With this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of [YOLOv3](https://paperswithcode.com/method/yolov3).""" ;
skos:prefLabel "ASFF" .
:ASLFeat a skos:Concept ;
dcterms:source ;
skos:definition "**ASLFeat** is a convolutional neural network for learning local features that uses deformable convolutional networks to densely estimate and apply local transformation. It also takes advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, it uses a peakiness measurement to relate feature responses and derive more indicative detection scores." ;
skos:prefLabel "ASLFeat" .
:ASPP a skos:Concept ;
dcterms:source ;
skos:altLabel "Atrous Spatial Pyramid Pooling" ;
skos:definition "**Atrous Spatial Pyramid Pooling (ASPP)** is a semantic segmentation module for resampling a given feature layer at multiple rates prior to [convolution](https://paperswithcode.com/method/convolution). This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, the mapping is implemented using multiple parallel atrous convolutional layers with different sampling rates." ;
skos:prefLabel "ASPP" .
:ASU a skos:Concept ;
dcterms:source ;
skos:altLabel "Amplifying Sine Unit: An Oscillatory Activation Function for Deep Neural Networks to Recover Nonlinear Oscillations Efficiently" ;
skos:definition "2023" ;
skos:prefLabel "ASU" .
:ASVI a skos:Concept ;
dcterms:source ;
skos:altLabel "Automatic Structured Variational Inference" ;
skos:definition "**Automatic Structured Variational Inference (ASVI)** is a fully automated method for constructing structured variational families, inspired by the closed-form update in conjugate Bayesian models. These convex-update families incorporate the forward pass of the input probabilistic program and can therefore capture complex statistical dependencies. Convex-update families have the same space and time complexity as the input probabilistic program and are therefore tractable for a very large family of models including both continuous and discrete variables." ;
skos:prefLabel "ASVI" .
:ATMO a skos:Concept ;
dcterms:source ;
skos:altLabel "AdapTive Meta Optimizer" ;
skos:definition """This method combines multiple optimization techniques like [ADAM](https://paperswithcode.com/method/adam) and [SGD](https://paperswithcode.com/method/sgd) or PADAM. This method can be applied to any couple of optimizers.\r
\r
Image credit: [Combining Optimization Methods Using an Adaptive Meta Optimizer](https://www.mdpi.com/1999-4893/14/6/186)""" ;
skos:prefLabel "ATMO" .
:ATSS a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Adaptive Training Sample Selection" ;
skos:definition """**Adaptive Training Sample Selection**, or **ATSS**, is a method to automatically select positive and negative samples according to statistical characteristics of object. It bridges the gap between anchor-based and anchor-free detectors. \r
\r
For each ground-truth box $g$ on the image, we first find out its candidate positive samples. As described in Line $3$ to $6$, on each pyramid level, we select $k$ anchor boxes whose center are closest to the center of $g$ based on L2 distance. Supposing there are $\\mathcal{L}$ feature pyramid levels, the ground-truth box $g$ will have $k\\times\\mathcal{L}$ candidate positive samples. After that, we compute the IoU between these candidates and the ground-truth $g$ as $\\mathcal{D}_g$ in Line $7$, whose mean and standard deviation are computed as $m_g$ and $v_g$ in Line $8$ and Line $9$. With these statistics, the IoU threshold for this ground-truth $g$ is obtained as $t_g=m_g+v_g$ in Line $10$. Finally, we select these candidates whose IoU are greater than or equal to the threshold $t_g$ as final positive samples in Line $11$ to $15$. \r
\r
Notably ATSS also limits the positive samples' center to the ground-truth box as shown in Line $12$. Besides, if an anchor box is assigned to multiple ground-truth boxes, the one with the highest IoU will be selected. The rest are negative samples.""" ;
skos:prefLabel "ATSS" .
:AUCC a skos:Concept ;
skos:altLabel "Area Under the ROC Curve for Clustering" ;
skos:definition "The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well." ;
skos:prefLabel "AUCC" .
:AUCOResNet a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Auditory Cortex ResNet" ;
skos:definition "The Auditory Cortex ResNet, briefly AUCO ResNet, is proposed and tested. It is a deep neural network architecture especially designed for audio classification trained end-to-end. It is inspired by the architectural organization of rat's auditory cortex, containing also innovations 2 and 3. The network outperforms the state-of-the-art accuracies on a reference audio benchmark dataset without any kind of preprocessing, imbalanced data handling and, most importantly, any kind of data augmentation." ;
skos:prefLabel "AUCO ResNet" .
:AVSlowFast a skos:Concept ;
dcterms:source ;
skos:altLabel "Audiovisual SlowFast Network" ;
skos:definition "**Audiovisual SlowFast Network**, or **AVSlowFast**, is an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are integrated with a Faster Audio pathway to model vision and sound in a unified representation. Audio and visual features are fused at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, [DropPathway](https://paperswithcode.com/method/droppathway) is used, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, hierarchical audiovisual synchronization is performed to learn joint audiovisual features." ;
skos:prefLabel "AVSlowFast" .
:AWARE a skos:Concept ;
skos:altLabel "Attentive Walk-Aggregating Graph Neural Network" ;
skos:definition "We propose to theoretically and empirically examine the effect of incorporating weighting schemes into walk-aggregating GNNs. To this end, we propose a simple, interpretable, and end-to-end supervised GNN model, called AWARE (Attentive Walk-Aggregating GRaph Neural NEtwork), for graph-level prediction. AWARE aggregates the walk information by means of weighting schemes at distinct levels (vertex-, walk-, and graph-level) in a principled manner. By virtue of the incorporated weighting schemes at these different levels, AWARE can emphasize the information important for prediction while diminishing the irrelevant ones—leading to representations that can improve learning performance." ;
skos:prefLabel "AWARE" .
:AWD-LSTM a skos:Concept ;
dcterms:source ;
skos:altLabel "ASGD Weight-Dropped LSTM" ;
skos:definition "**ASGD Weight-Dropped LSTM**, or **AWD-LSTM**, is a type of recurrent neural network that employs [DropConnect](https://paperswithcode.com/method/dropconnect) for regularization, as well as [NT-ASGD](https://paperswithcode.com/method/nt-asgd) for optimization - non-monotonically triggered averaged [SGD](https://paperswithcode.com/method/sgd) - which returns an average of last iterations of weights. Additional regularization techniques employed include variable length backpropagation sequences, [variational dropout](https://paperswithcode.com/method/variational-dropout), [embedding dropout](https://paperswithcode.com/method/embedding-dropout), [weight tying](https://paperswithcode.com/method/weight-tying), independent embedding/hidden size, [activation regularization](https://paperswithcode.com/method/activation-regularization) and [temporal activation regularization](https://paperswithcode.com/method/temporal-activation-regularization)." ;
skos:prefLabel "AWD-LSTM" .
:AbsolutePositionEncodings a skos:Concept ;
dcterms:source ;
skos:definition """**Absolute Position Encodings** are a type of position embeddings for [[Transformer](https://paperswithcode.com/method/transformer)-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d\\_{model}$ as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:\r
\r
$$ \\text{PE}\\left(pos, 2i\\right) = \\sin\\left(pos/10000^{2i/d\\_{model}}\\right) $$\r
\r
$$ \\text{PE}\\left(pos, 2i+1\\right) = \\cos\\left(pos/10000^{2i/d\\_{model}}\\right) $$\r
\r
where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\\pi$ to $10000 \\dot 2\\pi$. This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $\\text{PE}\\_{pos+k}$ can be represented as a linear function of $\\text{PE}\\_{pos}$.\r
\r
Image Source: [D2L.ai](https://d2l.ai/chapter_attention-mechanisms/self-attention-and-positional-encoding.html)""" ;
skos:prefLabel "Absolute Position Encodings" .
:AccoMontage a skos:Concept ;
dcterms:source ;
skos:definition "**AccoMontage** is a model for accompaniment arrangement, a type of music generation task involving intertwined constraints of melody, harmony, texture, and music structure. AccoMontage generates piano accompaniments for folk/pop songs based on a lead sheet (i.e. a melody with chord progression). It first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure deep learning approaches, AccoMontage uses a hybrid pathway, in which rule-based optimization and deep learning are both leveraged." ;
skos:prefLabel "AccoMontage" .
:Accordion a skos:Concept ;
dcterms:source ;
skos:definition "**Accordion** is a gradient communication scheduling algorithm that is generic across models while imposing low computational overheads. Accordion inspects the change in the gradient norms to detect critical regimes and adjusts the communication schedule dynamically. Accordion works for both adjusting the gradient compression rate or the batch size without additional parameter tuning." ;
skos:prefLabel "Accordion" .
:AccumulatingEligibilityTrace a skos:Concept ;
skos:definition """An **Accumulating Eligibility Trace** is a type of [eligibility trace](https://paperswithcode.com/method/eligibility-trace) where the trace increments in an accumulative way. For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r
\r
$$\\mathbf{e\\_{0}} = \\textbf{0}$$\r
\r
$$\\textbf{e}\\_{t} = \\nabla{\\hat{v}}\\left(S\\_{t}, \\mathbf{\\theta}\\_{t}\\right) + \\gamma\\lambda\\textbf{e}\\_{t}$$""" ;
skos:prefLabel "Accumulating Eligibility Trace" .
:Accuracy-RobustnessArea\(ARA\) a skos:Concept ;
dcterms:source ;
skos:altLabel "Accuracy-Robustness Area" ;
skos:definition "In the space of adversarial perturbation against classifier accuracy, the ARA is the area between a classifier's curve and the straight line defined by a naive classifier's maximum accuracy. Intuitively, the ARA measures a combination of the classifier’s predictive power and its ability to overcome an adversary. Importantly, when contrasted against existing robustness metrics, the ARA takes into account the classifier’s performance against all adversarial examples, without bounding them by some arbitrary $\\epsilon$." ;
skos:prefLabel "Accuracy-Robustness Area (ARA)" .
:ActivationNormalization a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**Activation Normalization** is a type of normalization used for flow-based generative models; specifically it was introduced in the [GLOW](https://paperswithcode.com/method/glow) architecture. An ActNorm layer performs an affine transformation of the activations using a scale and bias parameter per channel, similar to [batch normalization](https://paperswithcode.com/method/batch-normalization). These parameters are initialized such that the post-actnorm activations per-channel have zero mean and unit variance given an initial minibatch of data. This is a form of data dependent initilization. After initialization, the scale and bias are treated as regular trainable parameters that are independent of the data." ;
skos:prefLabel "Activation Normalization" .
:ActivationRegularization a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Activation Regularization (AR)**, or $L\\_{2}$ activation regularization, is regularization performed on activations as opposed to weights. It is usually used in conjunction with [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks). It is defined as:\r
\r
$$\\alpha{L}\\_{2}\\left(m\\circ{h\\_{t}}\\right) $$\r
\r
where $m$ is a [dropout](https://paperswithcode.com/method/dropout) mask used by later parts of the model, $L\\_{2}$ is the $L\\_{2}$ norm, and $h_{t}$ is the output of an RNN at timestep $t$, and $\\alpha$ is a scaling coefficient. \r
\r
When applied to the output of a dense layer, AR penalizes activations that are substantially away from 0, encouraging activations to remain small.""" ;
skos:prefLabel "Activation Regularization" .
:ActiveConvolution a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "An **Active Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) which does not have a fixed shape of the receptive field, and can be used to take more diverse forms of receptive fields for convolutions. Its shape can be learned through backpropagation during training. It can be seen as a generalization of convolution; it can define not only all conventional convolutions, but also convolutions with fractional pixel coordinates. We can freely change the shape of the convolution, which provides greater freedom to form CNN structures. Second, the shape of the convolution is learned while training and there is no need to tune it by hand" ;
skos:prefLabel "Active Convolution" .
:AdaBound a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**AdaBound** is a variant of the [Adam](https://paperswithcode.com/method/adabound) stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AdaBound can be regarded as an adaptive method at the beginning of training, and thereafter it gradually and smoothly transforms to [SGD](https://paperswithcode.com/method/sgd) (or with momentum) as the time step increases. \r
\r
$$ g\\_{t} = \\nabla{f}\\_{t}\\left(x\\_{t}\\right) $$\r
\r
$$ m\\_{t} = \\beta\\_{1t}m\\_{t-1} + \\left(1-\\beta\\_{1t}\\right)g\\_{t} $$\r
\r
$$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2} \\text{ and } V\\_{t} = \\text{diag}\\left(v\\_{t}\\right) $$\r
\r
$$ \\hat{\\eta}\\_{t} = \\text{Clip}\\left(\\alpha/\\sqrt{V\\_{t}}, \\eta\\_{l}\\left(t\\right), \\eta\\_{u}\\left(t\\right)\\right) \\text{ and } \\eta\\_{t} = \\hat{\\eta}\\_{t}/\\sqrt{t} $$\r
\r
$$ x\\_{t+1} = \\Pi\\_{\\mathcal{F}, \\text{diag}\\left(\\eta\\_{t}^{-1}\\right)}\\left(x\\_{t} - \\eta\\_{t} \\odot m\\_{t} \\right) $$\r
\r
Where $\\alpha$ is the initial step size, and $\\eta_{l}$ and $\\eta_{u}$ are the lower and upper bound functions respectively.""" ;
skos:prefLabel "AdaBound" .
:AdaDelta a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**AdaDelta** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https://paperswithcode.com/method/sgd). It is an extension of [Adagrad](https://paperswithcode.com/method/adagrad) that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$.\r
\r
Instead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r
\r
$$E\\left[g^{2}\\right]\\_{t} = \\gamma{E}\\left[g^{2}\\right]\\_{t-1} + \\left(1-\\gamma\\right)g^{2}\\_{t}$$\r
\r
Usually $\\gamma$ is set to around $0.9$. Rewriting SGD updates in terms of the parameter update vector:\r
\r
$$ \\Delta\\theta_{t} = -\\eta\\cdot{g\\_{t, i}}$$\r
$$\\theta\\_{t+1} = \\theta\\_{t} + \\Delta\\theta_{t}$$\r
\r
AdaDelta takes the form:\r
\r
$$ \\Delta\\theta_{t} = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g_{t} $$\r
\r
The main advantage of AdaDelta is that we do not need to set a default learning rate.""" ;
skos:prefLabel "AdaDelta" .
:AdaGPR a skos:Concept ;
dcterms:source ;
skos:definition "**AdaGPR** is an adaptive, layer-wise graph [convolution](https://paperswithcode.com/method/convolution) model. AdaGPR applies adaptive generalized Pageranks at each layer of a [GCNII](https://paperswithcode.com/method/gcnii) model by learning to predict the coefficients of generalized Pageranks using sparse solvers." ;
skos:prefLabel "AdaGPR" .
:AdaGrad a skos:Concept ;
rdfs:seeAlso ;
skos:definition """**AdaGrad** is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\\eta$ at each time step $t$ for every parameter $\\theta\\_{i}$ based on the past gradients for $\\theta\\_{i}$: \r
\r
$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\frac{\\eta}{\\sqrt{G\\_{t, ii} + \\epsilon}}g\\_{t, i} $$\r
\r
The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.\r
\r
Image: [Alec Radford](https://twitter.com/alecrad)""" ;
skos:prefLabel "AdaGrad" .
:AdaHessian a skos:Concept ;
dcterms:source ;
skos:altLabel "ADAHESSIAN" ;
skos:definition "AdaHessian achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of [ADAM](https://paperswithcode.com/method/adam). In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that AdaHessian: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to ADAM; (ii) outperforms ADAMW for transformers by 0.27/0.33 BLEU score on IWSLT14/WMT14 and 1.8/1.0 PPL on PTB/Wikitext-103; and (iii) achieves 0.032% better score than [AdaGrad](https://paperswithcode.com/method/adagrad) for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of AdaHessian is comparable to first-order methods, and that it exhibits robustness towards its hyperparameters." ;
skos:prefLabel "AdaHessian" .
:AdaMod a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**AdaMod** is a stochastic optimizer that restricts adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks.\r
\r
\r
The weight updates are performed as:\r
\r
\r
$$ g\\_{t} = \\nabla{f}\\_{t}\\left(\\theta\\_{t-1}\\right) $$\r
\r
$$ m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r
\r
$$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2} $$\r
\r
$$ \\hat{m}\\_{t} = m\\_{t} / \\left(1 - \\beta^{t}\\_{1}\\right)$$\r
\r
$$ \\hat{v}\\_{t} = v\\_{t} / \\left(1 - \\beta^{t}\\_{2}\\right)$$\r
\r
$$ \\eta\\_{t} = \\alpha\\_{t} / \\left(\\sqrt{\\hat{v}\\_{t}} + \\epsilon\\right) $$\r
\r
$$ s\\_{t} = \\beta\\_{3}s\\_{t-1} + (1-\\beta\\_{3})\\eta\\_{t} $$\r
\r
$$ \\hat{\\eta}\\_{t} = \\text{min}\\left(\\eta\\_{t}, s\\_{t}\\right) $$\r
\r
$$ \\theta\\_{t} = \\theta\\_{t-1} - \\hat{\\eta}\\_{t}\\hat{m}\\_{t} $$""" ;
skos:prefLabel "AdaMod" .
:AdaRNN a skos:Concept ;
dcterms:source ;
skos:definition "**AdaRNN** is an adaptive [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks) that learns an adaptive model through two modules: [Temporal Distribution Characterization](https://paperswithcode.com/method/temporal-distribution-characterization) (TDC) and [Temporal Distribution Matching](https://paperswithcode.com/method/temporal-distribution-matching) (TDM) algorithms. Firstly, to better characterize the distribution information in time-series, TDC splits the training data into $K$ most diverse periods that have a large distribution gap inspired by the principle of maximum entropy. After that, a temporal distribution matching (TDM) algorithm is used to dynamically reduce distribution divergence using a [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks)-based model." ;
skos:prefLabel "AdaRNN" .
:AdaShift a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**AdaShift** is a type of adaptive stochastic optimizer that decorrelates $v\\_{t}$ and $g\\_{t}$ in [Adam](https://paperswithcode.com/method/adam) by temporal shifting, i.e., using temporally shifted gradient $g\\_{t−n}$ to calculate $v\\_{t}$. The authors argue that an inappropriate correlation between gradient $g\\_{t}$ and the second-moment term $v\\_{t}$ exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.\r
\r
The AdaShift updates, based on the idea of temporal independence between gradients, are as follows:\r
\r
$$ g\\_{t} = \\nabla{f\\_{t}}\\left(\\theta\\_{t}\\right) $$\r
\r
$$ m\\_{t} = \\sum^{n-1}\\_{i=0}\\beta^{i}\\_{1}g\\_{t-i}/\\sum^{n-1}\\_{i=0}\\beta^{i}\\_{1} $$\r
\r
Then for $i=1$ to $M$:\r
\r
$$ v\\_{t}\\left[i\\right] = \\beta\\_{2}v\\_{t-1}\\left[i\\right] + \\left(1-\\beta\\_{2}\\right)\\phi\\left(g^{2}\\_{t-n}\\left[i\\right]\\right) $$\r
\r
$$ \\theta\\_{t}\\left[i\\right] = \\theta\\_{t-1}\\left[i\\right] - \\alpha\\_{t}/\\sqrt{v\\_{t}\\left[i\\right]}\\cdot{m\\_{t}\\left[i\\right]} $$""" ;
skos:prefLabel "AdaShift" .
:AdaSmooth a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Smooth Optimizer" ;
skos:definition """**AdaSmooth** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https://paperswithcode.com/method/sgd). It is an extension of [Adagrad](https://paperswithcode.com/method/adagrad) and [AdaDelta](https://paperswithcode.com/method/adadelta) that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$ while AdaSmooth adaptively selects the size of the window.\r
\r
Given the window size $M$, the effective ratio is calculated by \r
\r
$$e_t = \\frac{s_t}{n_t}= \\frac{| x_t - x_{t-M}|}{\\sum_{i=0}^{M-1} | x_{t-i} - x_{t-1-i}|}\\\\\r
= \\frac{| \\sum_{i=0}^{M-1} \\Delta x_{t-1-i}|}{\\sum_{i=0}^{M-1} | \\Delta x_{t-1-i}|}.$$\r
\r
Given the effective ratio, the scaled smoothing constant is obtained by:\r
\r
$$c_t = ( \\rho_2- \\rho_1) \\times e_t + (1-\\rho_2),$$\r
\r
The running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r
\r
$$ E\\left[g^{2}\\right]\\_{t} = c_t^2 \\odot g_{t}^2 + \\left(1-c_t^2 \\right)\\odot E[g^2]_{t-1} $$\r
\r
Usually $\\rho_1$ is set to around $0.5$ and $\\rho_2$ is set to around 0.99. The update step the follows:\r
\r
$$ \\Delta x_t = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}} \\odot g_{t}, $$\r
\r
which is incorporated into the final update:\r
\r
$$x_{t+1} = x_{t} + \\Delta x_t.$$\r
\r
The main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.""" ;
skos:prefLabel "AdaSmooth" .
:AdaSqrt a skos:Concept ;
dcterms:source ;
skos:definition """**AdaSqrt** is a stochastic optimization technique that is motivated by the observation that methods like [Adagrad](https://paperswithcode.com/method/adagrad) and [Adam](https://paperswithcode.com/method/adam) can be viewed as relaxations of [Natural Gradient Descent](https://paperswithcode.com/method/natural-gradient-descent).\r
\r
The updates are performed as follows:\r
\r
$$ t \\leftarrow t + 1 $$\r
\r
$$ \\alpha\\_{t} \\leftarrow \\sqrt{t} $$\r
\r
$$ g\\_{t} \\leftarrow \\nabla\\_{\\theta}f\\left(\\theta\\_{t-1}\\right) $$\r
\r
$$ S\\_{t} \\leftarrow S\\_{t-1} + g\\_{t}^{2} $$\r
\r
$$ \\theta\\_{t+1} \\leftarrow \\theta\\_{t} + \\eta\\frac{\\alpha\\_{t}g\\_{t}}{S\\_{t} + \\epsilon} $$""" ;
skos:prefLabel "AdaSqrt" .
:Adabelief a skos:Concept ;
dcterms:source ;
skos:definition "" ;
skos:prefLabel "Adabelief" .
:Adafactor a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Adafactor** is a stochastic optimization method based on [Adam](https://paperswithcode.com/method/adam) that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n \\times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$. \r
\r
Instead of defining the optimization algorithm in terms of absolute step sizes {$\\alpha_t$}$\\_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\\rho_t$}$\\_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant $\\epsilon_2$. The reason for this lower bound is to allow zero-initialized parameters to escape 0. \r
\r
Proposed hyperparameters are: $\\epsilon\\_{1} = 10^{-30}$, $\\epsilon\\_{2} = 10^{-3}$, $d=1$, $p\\_{t} = \\min\\left(10^{-2}, \\frac{1}{\\sqrt{t}}\\right)$, $\\hat{\\beta}\\_{2\\_{t}} = 1 - t^{-0.8}$.""" ;
skos:prefLabel "Adafactor" .
:AdamW a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**AdamW** is a stochastic optimization method that modifies the typical implementation of weight decay in [Adam](https://paperswithcode.com/method/adam), by decoupling [weight decay](https://paperswithcode.com/method/weight-decay) from the gradient update. To see this, $L\\_{2}$ regularization in Adam is usually implemented with the below modification where $w\\_{t}$ is the rate of the weight decay at time $t$:\r
\r
$$ g\\_{t} = \\nabla{f\\left(\\theta\\_{t}\\right)} + w\\_{t}\\theta\\_{t}$$\r
\r
while AdamW adjusts the weight decay term to appear in the gradient update:\r
\r
$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\eta\\left(\\frac{1}{\\sqrt{\\hat{v}\\_{t} + \\epsilon}}\\cdot{\\hat{m}\\_{t}} + w\\_{t, i}\\theta\\_{t, i}\\right), \\forall{t}$$""" ;
skos:prefLabel "AdamW" .
:Adapter a skos:Concept ;
dcterms:source ;
skos:definition "" ;
skos:prefLabel "Adapter" .
:AdaptiveBins a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Bins" ;
skos:definition "" ;
skos:prefLabel "AdaptiveBins" .
:AdaptiveDropout a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Adaptive Dropout** is a regularization technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [Dropout](https://paperswithcode.com/method/dropout) will ignore this confidence and drop the unit out 50% of the time. \r
\r
Denote the activity of unit $j$ in a deep neural network by $a\\_{j}$ and assume that its inputs are {$a\\_{i}: i < j$}. In dropout, $a\\_{j}$ is randomly set to zero with probability 0.5. Let $m\\_{j}$ be a binary variable that is used to mask, the activity $a\\_{j}$, so that its value is:\r
\r
$$ a\\_{j} = m\\_{j}g \\left( \\sum\\_{i: i ;
rdfs:seeAlso ;
skos:definition """**Adaptive Feature Pooling** pools features from all levels for each proposal in object detection and fuses them for the following prediction. For each proposal, we map them to different feature levels. Following the idea of [Mask R-CNN](https://paperswithcode.com/method/adaptive-feature-pooling), [RoIAlign](https://paperswithcode.com/method/roi-align) is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels.\r
\r
The motivation for this technique is that in an [FPN](https://paperswithcode.com/method/fpn) we assign proposals to different feature levels based on the size of proposals, which could be suboptimal if images with small differences are assigned to different levels, or if the importance of features is not strongly correlated to their level which they belong.""" ;
skos:prefLabel "Adaptive Feature Pooling" .
:AdaptiveInputRepresentations a skos:Concept ;
dcterms:source ;
skos:definition "**Adaptive Input Embeddings** extend the [adaptive softmax](https://paperswithcode.com/method/adaptive-softmax) to input word representations. The factorization assigns more capacity to frequent words and reduces the capacity for less frequent words with the benefit of reducing overfitting to rare words." ;
skos:prefLabel "Adaptive Input Representations" .
:AdaptiveInstanceNormalization a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Adaptive Instance Normalization** is a normalization method that aligns the mean and variance of the content features with those of the style features. \r
\r
[Instance Normalization](https://paperswithcode.com/method/instance-normalization) normalizes the input to a single style specified by the affine parameters. Adaptive Instance Normaliation is an extension. In AdaIN, we receive a content input $x$ and a style input $y$, and we simply align the channel-wise mean and variance of $x$ to match those of $y$. Unlike [Batch Normalization](https://paperswithcode.com/method/batch-normalization), Instance Normalization or [Conditional Instance Normalization](https://paperswithcode.com/method/conditional-instance-normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:\r
\r
$$\r
\\textrm{AdaIN}(x, y)= \\sigma(y)\\left(\\frac{x-\\mu(x)}{\\sigma(x)}\\right)+\\mu(y)\r
$$""" ;
skos:prefLabel "Adaptive Instance Normalization" .
:AdaptiveLoss a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Robust Loss" ;
skos:definition "The Robust Loss is a generalization of the Cauchy/Lorentzian, Geman-McClure, Welsch/Leclerc, generalized Charbonnier, Charbonnier/pseudo-Huber/L1-L2, and L2 loss functions. By introducing robustness as a continuous parameter, the loss function allows algorithms built around robust loss minimization to be generalized, which improves performance on basic vision tasks such as registration and clustering. Interpreting the loss as the negative log of a univariate density yields a general probability distribution that includes normal and Cauchy distributions as special cases. This probabilistic interpretation enables the training of neural networks in which the robustness of the loss automatically adapts itself during training, which improves performance on learning-based tasks such as generative image synthesis and unsupervised monocular depth estimation, without requiring any manual parameter tuning." ;
skos:prefLabel "Adaptive Loss" .
:AdaptiveMasking a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Adaptive Masking** is a type of attention mechanism that allows a model to learn its own context size to attend over. For each head in [Multi-Head Attention](https://paperswithcode.com/method/multi-head-attention), a masking function is added to control for the span of the attention. A masking function is a non-increasing function that maps a\r
distance to a value in $\\left[0, 1\\right]$. Adaptive masking takes the following soft masking function $m\\_{z}$ parametrized by a real value $z$ in $\\left[0, S\\right]$:\r
\r
$$ m\\_{z}\\left(x\\right) = \\min\\left[\\max\\left[\\frac{1}{R}\\left(R+z-x\\right), 0\\right], 1\\right] $$\r
\r
where $R$ is a hyper-parameter that controls its softness. The shape of this piecewise function as a function of the distance. This soft masking function is inspired by [Jernite et al. (2017)](https://arxiv.org/abs/1611.06188). The attention weights from are then computed on the masked span:\r
\r
$$ a\\_{tr} = \\frac{m\\_{z}\\left(t-r\\right)\\exp\\left(s\\_{tr}\\right)}{\\sum^{t-1}\\_{q=t-S}m\\_{z}\\left(t-q\\right)\\exp\\left(s\\_{tq}\\right)}$$\r
\r
A $\\mathcal{l}\\_{1}$ penalization is added on the parameters $z\\_{i}$ for each attention head $i$ of the model to the loss function:\r
\r
$$ L = - \\log{P}\\left(w\\_{1}, \\dots, w\\_{T}\\right) + \\frac{\\lambda}{M}\\sum\\_{i}z\\_{i} $$\r
\r
where $\\lambda > 0$ is the regularization hyperparameter, and $M$ is the number of heads in each\r
layer. This formulation is differentiable in the parameters $z\\_{i}$, and learnt jointly with the rest of the model.""" ;
skos:prefLabel "Adaptive Masking" .
:AdaptiveNMS a skos:Concept ;
dcterms:source ;
skos:definition "**Adaptive Non-Maximum Suppression** is a non-maximum suppression algorithm that applies a dynamic suppression threshold to an instance according to the target density. The motivation is to find an NMS algorithm that works well for pedestrian detection in a crowd. Intuitively, a high NMS threshold keeps more crowded instances while a low NMS threshold wipes out more false positives. The adaptive-NMS thus applies a dynamic suppression strategy, where the threshold rises as instances gather and occlude each other and decays when instances appear separately. To this end, an auxiliary and learnable sub-network is designed to predict the adaptive NMS threshold for each instance." ;
skos:prefLabel "Adaptive NMS" .
:AdaptiveSoftmax a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Adaptive Softmax** is a speedup technique for the computation of probability distributions over words. The adaptive [softmax](https://paperswithcode.com/method/softmax) is inspired by the class-based [hierarchical softmax](https://paperswithcode.com/method/hierarchical-softmax), where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node\r
and reducing the capacity of rare words.""" ;
skos:prefLabel "Adaptive Softmax" .
:AdaptiveSpanTransformer a skos:Concept ;
dcterms:source ;
skos:definition """The **Adaptive Attention Span Transformer** is a Transformer that utilises an improvement to the self-attention layer called [adaptive masking](https://paperswithcode.com/method/adaptive-masking) that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.\r
\r
Their proposals are based on the observation that, with the dense attention of a traditional [Transformer](https://paperswithcode.com/method/transformer), each attention head shares the same attention span $S$ (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).""" ;
skos:prefLabel "Adaptive Span Transformer" .
:AdaptivelySparseTransformer a skos:Concept ;
dcterms:source ;
skos:definition "The **Adaptively Sparse Transformer** is a type of [Transformer](https://paperswithcode.com/method/transformer)." ;
skos:prefLabel "Adaptively Sparse Transformer" .
:AdditiveAttention a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Additive Attention**, also known as **Bahdanau Attention**, uses a one-hidden layer feed-forward network to calculate the attention alignment score:\r
\r
$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = v\\_{a}^{T}\\tanh\\left(\\textbf{W}\\_{a}\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right]\\right)$$\r
\r
where $\\textbf{v}\\_{a}$ and $\\textbf{W}\\_{a}$ are learned attention parameters. Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows.\r
\r
Within a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https://paperswithcode.com/method/softmax) function of these alignment scores (ensuring it sums to 1).""" ;
skos:prefLabel "Additive Attention" .
:AdvProp a skos:Concept ;
dcterms:source ;
skos:definition "**AdvProp** is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to the method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples." ;
skos:prefLabel "AdvProp" .
:AdversarialColorEnhancement a skos:Concept ;
dcterms:source ;
skos:definition "**Adversarial Color Enhancement** is an approach to generating unrestricted adversarial images by optimizing a color filter via gradient descent." ;
skos:prefLabel "Adversarial Color Enhancement" .
:AdversarialSoftAdvantageFitting\(ASAF\) a skos:Concept ;
dcterms:source ;
skos:definition "" ;
skos:prefLabel "Adversarial Soft Advantage Fitting (ASAF)" .
:AdversarialSolarization a skos:Concept ;
dcterms:source ;
skos:definition "" ;
skos:prefLabel "Adversarial Solarization" .
:AffCorrs a skos:Concept ;
rdfs:seeAlso ;
skos:altLabel "Affordance Correspondence" ;
skos:definition "Method for one-shot visual search of object parts / one-shot semantic part correspondence. Given a single reference image of an object with annotated affordance regions, it segments semantically corresponding parts within a target scene. AffCorrs is used to find corresponding affordances both for intra- and inter-class one-shot part segmentation." ;
skos:prefLabel "AffCorrs" .
:AffineCoupling a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Affine Coupling** is a method for implementing a normalizing flow (where we stack a sequence of invertible bijective transformation functions). Affine coupling is one of these bijective transformation functions. Specifically, it is an example of a reversible transformation where the forward function, the reverse function and the log-determinant are computationally efficient. For the forward function, we split the input dimension into two parts:\r
\r
$$ \\mathbf{x}\\_{a}, \\mathbf{x}\\_{b} = \\text{split}\\left(\\mathbf{x}\\right) $$\r
\r
The second part stays the same $\\mathbf{x}\\_{b} = \\mathbf{y}\\_{b}$, while the first part $\\mathbf{x}\\_{a}$ undergoes an affine transformation, where the parameters for this transformation are learnt using the second part $\\mathbf{x}\\_{b}$ being put through a neural network. Together we have:\r
\r
$$ \\left(\\log{\\mathbf{s}, \\mathbf{t}}\\right) = \\text{NN}\\left(\\mathbf{x}\\_{b}\\right) $$\r
\r
$$ \\mathbf{s} = \\exp\\left(\\log{\\mathbf{s}}\\right) $$\r
\r
$$ \\mathbf{y}\\_{a} = \\mathbf{s} \\odot \\mathbf{x}\\_{a} + \\mathbf{t} $$\r
\r
$$ \\mathbf{y}\\_{b} = \\mathbf{x}\\_{b} $$\r
\r
$$ \\mathbf{y} = \\text{concat}\\left(\\mathbf{y}\\_{a}, \\mathbf{y}\\_{b}\\right) $$\r
\r
Image: [GLOW](https://paperswithcode.com/method/glow)""" ;
skos:prefLabel "Affine Coupling" .
:AffineOperator a skos:Concept ;
dcterms:source ;
skos:definition """The **Affine Operator** is an affine transformation layer introduced in the [ResMLP](https://paperswithcode.com/method/resmlp) architecture. This replaces [layer normalization](https://paperswithcode.com/method/layer-normalization), as in [Transformer based networks](https://paperswithcode.com/methods/category/transformers), which is possible since in the ResMLP, there are no [self-attention layers](https://paperswithcode.com/method/scaled) which makes training more stable - hence allowing a more simple affine transformation.\r
\r
The affine operator is defined as:\r
\r
$$ \\operatorname{Aff}_{\\mathbf{\\alpha}, \\mathbf{\\beta}}(\\mathbf{x})=\\operatorname{Diag}(\\mathbf{\\alpha}) \\mathbf{x}+\\mathbf{\\beta} $$\r
\r
where $\\alpha$ and $\\beta$ are learnable weight vectors. This operation only rescales and shifts the input element-wise. This operation has several advantages over other normalization operations: first, as opposed to Layer Normalization, it has no cost at inference time, since it can absorbed in the adjacent linear layer. Second, as opposed to [BatchNorm](https://paperswithcode.com/method/batch-normalization) and Layer Normalization, the Aff operator does not depend on batch statistics.""" ;
skos:prefLabel "Affine Operator" .
:AggMo a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Aggregated Momentum (AggMo)** is a variant of the [classical momentum](https://paperswithcode.com/method/sgd-with-momentum) stochastic optimizer which maintains several velocity vectors with different $\\beta$ parameters. AggMo averages the velocity vectors when updating the parameters. It resolves the problem of choosing a momentum parameter by taking a linear combination of multiple momentum buffers. Each of $K$ momentum buffers have a different discount factor $\\beta \\in \\mathbb{R}^{K}$, and these are averaged for the update. The update rule is:\r
\r
$$ \\textbf{v}\\_{t}^{\\left(i\\right)} = \\beta^{(i)}\\textbf{v}\\_{t-1}^{\\left(i\\right)} - \\nabla\\_{\\theta}f\\left(\\mathbf{\\theta}\\_{t-1}\\right) $$\r
\r
$$ \\mathbf{\\theta\\_{t}} = \\mathbf{\\theta\\_{t-1}} + \\frac{\\gamma\\_{t}}{K}\\sum^{K}\\_{i=1}\\textbf{v}\\_{t}^{\\left(i\\right)} $$\r
\r
where $v^{\\left(i\\right)}_{0}$ for each $i$. The vector $\\mathcal{\\beta} = \\left[\\beta^{(1)}, \\ldots, \\beta^{(K)}\\right]$ is the dampening factor.""" ;
skos:prefLabel "AggMo" .
:AgglomerativeContextualDecomposition a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**Agglomerative Contextual Decomposition (ACD)** is an interpretability method that produces hierarchical interpretations for a single prediction made by a neural network, by scoring interactions and building them into a tree. Given a prediction from a trained neural network, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive." ;
skos:prefLabel "Agglomerative Contextual Decomposition" .
:AggregatedLearning a skos:Concept ;
dcterms:source ;
skos:definition "**Aggregated Learning (AgrLearn)** is a vector-quantization approach to learning neural network classifiers. It builds on an equivalence between IB learning and IB quantization and exploits the power of vector quantization, which is well known in information theory." ;
skos:prefLabel "Aggregated Learning" .
:AgingEvolution a skos:Concept ;
dcterms:source ;
skos:definition """**Aging Evolution**, or **Regularized Evolution**, is an evolutionary algorithm for [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). Whereas in tournament selection, the best architectures are kept, in aging evolution we associate each genotype with an age, and bias the tournament selection to choose\r
the younger genotypes. In the context of architecture search, aging evolution allows us to explore the search space more, instead of zooming in on good models too early, as non-aging evolution would.""" ;
skos:prefLabel "Aging Evolution" .
:AlexNet a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**AlexNet** is a classic convolutional neural network architecture. It consists of convolutions, [max pooling](https://paperswithcode.com/method/max-pooling) and dense layers as the basic building blocks. Grouped convolutions are used in order to fit the model across two GPUs." ;
skos:prefLabel "AlexNet" .
:AlignPS a skos:Concept ;
dcterms:source ;
skos:altLabel "Feature-Aligned Person Search Network" ;
skos:definition "**AlignPS**, or **Feature-Aligned Person Search Network**, is an anchor-free framework for efficient person search. The model employs the typical architecture of an anchor-free detection model (i.e., [FCOS](https://paperswithcode.com/method/fcos)). An aligned feature aggregation (AFA) module is designed to make the model focus more on the re-id subtask. Specifically, AFA reshapes some building blocks of [FPN](https://paperswithcode.com/method/fpn) to overcome the issues of region and scale misalignment in re-id feature learning. A [deformable convolution](https://paperswithcode.com/method/deformable-convolution) is exploited to make the re-id embeddings adaptively aligned with the foreground regions. A feature fusion scheme is designed to better aggregate features from different FPN levels, which makes the re-id features more robust to scale variations. The training procedures of re-id and detection are also optimized to place more emphasis on generating robust re-id embeddings." ;
skos:prefLabel "AlignPS" .
:All-AttentionLayer a skos:Concept ;
dcterms:source ;
skos:definition "An **All-Attention Layer** is an attention module and layer for transformers that merges the self-attention and feedforward sublayers into a single unified attention layer. As opposed to the two-step mechanism of the [Transformer](https://paperswithcode.com/method/transformer) layer, it directly builds its representation from the context and a persistent memory block without going through a feedforward transformation. The additional persistent memory block stores, in the form of key-value vectors, information that does not depend on the context. In terms of parameters, these persistent key-value vectors replace the feedforward sublayer." ;
skos:prefLabel "All-Attention Layer" .
:AlphaFold a skos:Concept ;
dcterms:source ;
skos:definition """AlphaFold is a deep learning based algorithm for accurate protein structure prediction. AlphaFold incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.\r
\r
Description from: [Highly accurate protein structure prediction with AlphaFold](https://paperswithcode.com/paper/highly-accurate-protein-structure-prediction)\r
\r
Image credit: [DeepMind](https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology)""" ;
skos:prefLabel "AlphaFold" .
:AlphaStar a skos:Concept ;
rdfs:seeAlso ;
skos:altLabel "DeepMind AlphaStar" ;
skos:definition """**AlphaStar** is a reinforcement learning agent for tackling the game of Starcraft II. It learns a policy $\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}, z\\right) = P\\left[a\\_{t}\\mid{s\\_{t}}, z\\right]$ using a neural network for parameters $\\theta$ that receives observations $s\\_{t} = \\left(o\\_{1:t}, a\\_{1:t-1}\\right)$ as inputs and chooses actions as outputs. Additionally, the policy conditions on a statistic $z$ that summarizes a strategy sampled from human data such as a build order [1].\r
\r
AlphaStar uses numerous types of architecture to incorporate different types of features. Observations of player and enemy units are processed with a [Transformer](https://paperswithcode.com/method/transformer). Scatter connections are used to integrate spatial and non-spatial information. The temporal sequence of observations is processed by a core [LSTM](https://paperswithcode.com/method/lstm). Minimap features are extracted with a Residual Network. To manage the combinatorial action space, the agent uses an autoregressive policy and a recurrent [pointer network](https://paperswithcode.com/method/pointer-net).\r
\r
The agent is trained first with supervised learning from human replays. Parameters are subsequently trained using reinforcement learning that maximizes the win rate against opponents. The RL algorithm is based on a policy-gradient algorithm similar to actor-critic. Updates are performed asynchronously and off-policy. To deal with this, a combination of $TD\\left(\\lambda\\right)$ and [V-trace](https://paperswithcode.com/method/v-trace) are used, as well as a new self-imitation algorithm (UPGO).\r
\r
Lastly, to address game-theoretic challenges, AlphaStar is trained with league training to try to approximate a fictitious self-play (FSP) setting which avoids cycles by computing a best response against a uniform mixture of all previous policies. The league of potential opponents includes a diverse range of agents, including policies from current and previous agents.\r
\r
Image Credit: [Yekun Chai](https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/)\r
\r
#### References\r
1. Chai, Yekun. "AlphaStar: Grandmaster level in StarCraft II Explained." (2019). [https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/](https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/)\r
\r
#### Code Implementation\r
1. https://github.com/opendilab/DI-star""" ;
skos:prefLabel "AlphaStar" .
:AlphaZero a skos:Concept ;
dcterms:source ;
skos:definition "**AlphaZero** is a reinforcement learning agent for playing board games such as Go, chess, and shogi. " ;
skos:prefLabel "AlphaZero" .
:AltCLIP a skos:Concept ;
dcterms:source ;
skos:definition "In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI." ;
skos:prefLabel "AltCLIP" .
:AltDiffusion a skos:Concept ;
dcterms:source ;
skos:definition "In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI." ;
skos:prefLabel "AltDiffusion" .
:AlterNet a skos:Concept ;
dcterms:source ;
skos:definition "" ;
skos:prefLabel "AlterNet" .
:AmoebaNet a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**AmoebaNet** is a convolutional neural network found through regularized evolution architecture search. The search space is NASNet, which specifies a space of image classifiers with a fixed outer structure: a feed-forward stack of [Inception-like modules](https://paperswithcode.com/method/inception-module) called cells. The discovered architecture is shown to the right." ;
skos:prefLabel "AmoebaNet" .
:AnnealingSNNL a skos:Concept ;
dcterms:source ;
skos:altLabel "Soft Nearest Neighbor Loss with Annealing Temperature" ;
skos:definition "" ;
skos:prefLabel "Annealing SNNL" .
:Anti-AliasDownsampling a skos:Concept ;
dcterms:source ;
skos:definition "**Anti-Alias Downsampling (AA)** aims to improve the shift-equivariance of deep networks. Max-pooling is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing in any existing strided layer such as strided [convolution](https://paperswithcode.com/method/convolution). The smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur." ;
skos:prefLabel "Anti-Alias Downsampling" .
:AnycostGAN a skos:Concept ;
dcterms:source ;
skos:definition "**Anycost GAN** is a type of generative adversarial network for image synthesis and editing. Given an input image, we project it into the latent space with encoder $E$ and backward optimization. We can modify the latent code with user input to edit the image. During editing, a sub-generator of small cost is used for fast and interactive preview; during idle time, the full cost generator renders the final, high-quality output. The outputs from the full and sub-generators are visually consistent during projection and editing." ;
skos:prefLabel "Anycost GAN" .
:Ape-X a skos:Concept ;
dcterms:source ;
skos:definition """**Ape-X** is a distributed architecture for deep reinforcement learning. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared [experience replay](https://paperswithcode.com/method/experience-replay) memory; the learner replays samples of experience and updates the neural network. The architecture relies on [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) to focus only on the most significant data generated by the actors.\r
\r
In contrast to Gorila, Ape-X uses a shared, centralized replay memory, and instead of sampling\r
uniformly, it prioritizes, to sample the most useful data more often. All communications are batched with the centralized replay, increasing the efficiency and throughput at the cost of some latency. \r
And by learning off-policy, Ape-X has the ability to combine data from many distributed actors, by giving the different actors different exploration policies, broadening the diversity of the experience they jointly encounter.""" ;
skos:prefLabel "Ape-X" .
:Ape-XDPG a skos:Concept ;
dcterms:source ;
skos:definition "**Ape-X DPG** combines [DDPG](https://paperswithcode.com/method/ddpg) with distributed [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) through the [Ape-X](https://paperswithcode.com/method/ape-x) architecture." ;
skos:prefLabel "Ape-X DPG" .
:Ape-XDQN a skos:Concept ;
dcterms:source ;
skos:definition "**Ape-X DQN** is a variant of a [DQN](https://paperswithcode.com/method/dqn) with some components of [Rainbow-DQN](https://paperswithcode.com/method/rainbow-dqn) that utilizes distributed [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) through the [Ape-X](https://paperswithcode.com/method/ape-x) architecture." ;
skos:prefLabel "Ape-X DQN" .
:Apollo a skos:Concept ;
dcterms:source ;
skos:altLabel "Adaptive Parameter-wise Diagonal Quasi-Newton Method" ;
skos:definition "Please enter a description about the method here" ;
skos:prefLabel "Apollo" .
:ArcFace a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Additive Angular Margin Loss" ;
skos:definition """**ArcFace**, or **Additive Angular Margin Loss**, is a loss function used in face recognition tasks. The [softmax](https://paperswithcode.com/method/softmax) is traditionally used in these tasks. However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations. \r
\r
The ArcFace loss transforms the logits $W^{T}\\_{j}x\\_{i} = || W\\_{j} || \\text{ } || x\\_{i} || \\cos\\theta\\_{j}$,\r
where $\\theta\\_{j}$ is the angle between the weight $W\\_{j}$ and the feature $x\\_{i}$. The individual weight $ || W\\_{j} || = 1$ is fixed by $l\\_{2}$ normalization. The embedding feature $ ||x\\_{i} ||$ is fixed by $l\\_{2}$ normalization and re-scaled to $s$. The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding\r
features are thus distributed on a hypersphere with a radius of $s$. Finally, an additive angular margin penalty $m$ is added between $x\\_{i}$ and $W\\_{y\\_{i}}$ to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is\r
equal to the geodesic distance margin penalty in the normalised hypersphere, the method is named ArcFace:\r
\r
$$ L\\_{3} = -\\frac{1}{N}\\sum^{N}\\_{i=1}\\log\\frac{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)}}{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)} + \\sum^{n}\\_{j=1, j \\neq y\\_{i}}e^{s\\cos\\theta\\_{j}}} $$\r
\r
The authors select face images from 8 different identities containing enough samples (around 1,500 images/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively. As the Figure shows, the softmax loss provides roughly separable feature embedding\r
but produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes.\r
\r
Other alternatives to enforce intra-class compactness and inter-class distance include [Supervised Contrastive Learning](https://arxiv.org/abs/2004.11362).""" ;
skos:prefLabel "ArcFace" .
:Assemble-ResNet a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**Assemble-ResNet** is a modification to the [ResNet](https://paperswithcode.com/method/resnet) architecture with several tweaks including using [ResNet-D](https://paperswithcode.com/method/resnet-d), channel attention, [anti-alias downsampling](https://paperswithcode.com/method/anti-alias-downsampling), and Big Little Networks." ;
skos:prefLabel "Assemble-ResNet" .
:AssociativeLSTM a skos:Concept ;
dcterms:source ;
skos:definition """An **Associative LSTM** combines an [LSTM](https://paperswithcode.com/method/lstm) with ideas from Holographic Reduced Representations (HRRs) to enable key-value storage of data. HRRs use a “binding” operator to implement key-value\r
binding between two vectors (the key and its associated content). They natively implement associative arrays; as a byproduct, they can also easily implement stacks, queues, or lists.""" ;
skos:prefLabel "Associative LSTM" .
:AsynchronousInteractionAggregation a skos:Concept ;
dcterms:source ;
skos:definition "**Asynchronous Interaction Aggregation**, or **AIA**, is a network that leverages different interactions to boost action detection. There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance by modeling very long-term interaction dynamically." ;
skos:prefLabel "Asynchronous Interaction Aggregation" .
:AttLWB a skos:Concept ;
dcterms:source ;
skos:altLabel "Attentional Liquid Warping Block" ;
skos:definition "**Attentional Liquid Warping Block**, or **AttLWB**, is a module for human image synthesis GANs that propagates the source information - such as texture, style, color and face identity - in both image and feature spaces to the synthesized reference. It firstly learns similarities of the global features among all multiple sources features, and then it fuses the multiple sources features by a linear combination of the learned similarities and the multiple sources in the feature spaces. Finally, to better propagate the source identity (style, color, and texture) into the global stream, the fused source features are warped to the global stream by [Spatially-Adaptive Normalization](https://paperswithcode.com/method/spade) (SPADE)." ;
skos:prefLabel "AttLWB" .
:Attention-augmentedConvolution a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Attention-augmented Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) with a two-dimensional relative self-attention mechanism that can replace convolutions as a stand-alone computational primitive for image classification. It employs [scaled-dot product attention](https://paperswithcode.com/method/scaled) and [multi-head attention](https://paperswithcode.com/method/multi-head-attention) as with [Transformers](https://paperswithcode.com/method/transformer).\r
\r
It works by concatenating convolutional and attentional feature map. To see this, consider an original convolution operator with kernel size $k$, $F\\_{in}$ input filters and $F\\_{out}$ output filters. The corresponding attention augmented convolution can be written as"\r
\r
$$\\text{AAConv}\\left(X\\right) = \\text{Concat}\\left[\\text{Conv}(X), \\text{MHA}(X)\\right] $$\r
\r
$X$ originates from an input tensor of shape $\\left(H, W, F\\_{in}\\right)$. This is flattened to become $X \\in \\mathbb{R}^{HW \\times F\\_{in}}$ which is passed into a multi-head attention module, as well as a convolution (see above).\r
\r
Similarly to the convolution, the attention augmented convolution 1) is equivariant to translation and 2) can readily operate on inputs of different spatial dimensions.""" ;
skos:prefLabel "Attention-augmented Convolution" .
:AttentionDropout a skos:Concept ;
rdfs:seeAlso ;
skos:definition """**Attention Dropout** is a type of [dropout](https://paperswithcode.com/method/dropout) used in attention-based architectures, where elements are randomly dropped out of the [softmax](https://paperswithcode.com/method/softmax) in the attention equation. For example, for scaled-dot product attention, we would drop elements from the first term:\r
\r
$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$""" ;
skos:prefLabel "Attention Dropout" .
:AttentionFeatureFilters a skos:Concept ;
dcterms:source ;
skos:definition "An attention mechanism for content-based filtering of multi-level features. For example, recurrent features obtained by forward and backward passes of a bidirectional RNN block can be combined using attention feature filters, with unprocessed input features/embeddings as queries and recurrent features as keys/values." ;
skos:prefLabel "Attention Feature Filters" .
:AttentionFreeTransformer a skos:Concept ;
dcterms:source ;
skos:definition """**Attention Free Transformer**, or **AFT**, is an efficient variant of a [multi-head attention module](https://paperswithcode.com/method/multi-head-attention) that eschews [dot product self attention](https://paperswithcode.com/method/scaled). In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.\r
\r
Given the input $X$, AFT first linearly transforms them into $Q=X W^{Q}, K=X W^{K}, V=X W^{V}$, then performs following operation:\r
\r
$$\r
Y=f(X) ; Y\\_{t}=\\sigma\\_{q}\\left(Q\\_{t}\\right) \\odot \\frac{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right) \\odot V\\_{t^{\\prime}}}{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right)}\r
$$\r
\r
where $\\odot$ is the element-wise product; $\\sigma\\_{q}$ is the nonlinearity applied to the query with default being sigmoid; $w \\in R^{T \\times T}$ is the learned pair-wise position biases.\r
\r
Explained in words, for each target position $t$, AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.""" ;
skos:prefLabel "Attention Free Transformer" .
:AttentionGate a skos:Concept ;
dcterms:source ;
skos:definition """Attention gate focuses on targeted regions while suppressing feature activations in irrelevant regions.\r
Given the input feature map $X$ and the gating signal $G\\in \\mathbb{R}^{C'\\times H\\times W}$ which is collected at a coarse scale and contains contextual information, the attention gate uses additive attention to obtain the gating coefficient. Both the input $X$ and the gating signal are first linearly mapped to an $\\mathbb{R}^{F\\times H\\times W}$ dimensional space, and then the output is squeezed in the channel domain to produce a spatial attention weight map $ S \\in \\mathbb{R}^{1\\times H\\times W}$. The overall process can be written as\r
\\begin{align}\r
S &= \\sigma(\\varphi(\\delta(\\phi_x(X)+\\phi_g(G))))\r
\\end{align}\r
\\begin{align}\r
Y &= S X\r
\\end{align}\r
where $\\varphi$, $\\phi_x$ and $\\phi_g$ are linear transformations implemented as $1\\times 1$ convolutions. \r
\r
The attention gate guides the model's attention to important regions while suppressing feature activation in unrelated areas. It substantially enhances the representational power of the model without a significant increase in computing cost or number of model parameters due to its lightweight design. It is general and modular, making it simple to use in various CNN models.""" ;
skos:prefLabel "Attention Gate" .
:AttentionMesh a skos:Concept ;
dcterms:source ;
skos:definition "**Attention Mesh** is a neural network architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Specifically region-specific heads are employed that transform the feature maps with spatial transformers." ;
skos:prefLabel "Attention Mesh" .
:AttentionalLiquidWarpingGAN a skos:Concept ;
dcterms:source ;
skos:definition "**Attentional Liquid Warping GAN** is a type of generative adversarial network for human image synthesis that utilizes a [AttLWB](https://paperswithcode.com/method/attlwb) block, which is a 3D body mesh recovery module that disentangles pose and shape. To preserve the source information, such as texture, style, color, and face identity, the Attentional Liquid Warping GAN with AttLWB propagates the source information in both image and feature spaces to the synthesized reference." ;
skos:prefLabel "Attentional Liquid Warping GAN" .
:AttentiveNormalization a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**Attentive Normalization** generalizes the common affine transformation component in the vanilla feature normalization. Instead of learning a single affine transformation, AN learns a mixture of affine transformations and utilizes their weighted-sum as the final affine transformation applied to re-calibrate features in an instance-specific way. The weights are learned by leveraging feature attention." ;
skos:prefLabel "Attentive Normalization" .
:Attribute2Font a skos:Concept ;
dcterms:source ;
skos:definition "**Attribute2Font** is a model that automatically creates fonts by synthesizing visually pleasing glyph images according to user-specified attributes and their corresponding values. Specifically, Attribute2Font is trained to perform font style transfer between any two fonts conditioned on their attribute values. After training, the model can generate glyph images in accordance with an arbitrary set of font attribute values. A unit named Attribute Attention Module is designed to make those generated glyph images better embody the prominent font attributes. A semi-supervised learning scheme is also introduced to exploit a large number of unlabeled fonts" ;
skos:prefLabel "Attribute2Font" .
:AugMix a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "AugMix mixes augmented images through linear interpolations. Consequently it is like [Mixup](https://paperswithcode.com/method/mixup) but instead mixes augmented versions of the same image." ;
skos:prefLabel "AugMix" .
:AugmentedSBERT a skos:Concept ;
dcterms:source ;
skos:definition "**Augmented SBERT** is a data augmentation strategy for pairwise sentence scoring that uses a [BERT](https://paperswithcode.com/method/bert) cross-encoder to improve the performance for the [SBERT](https://paperswithcode.com/method/sbert) bi-encoders. Given a pre-trained, well-performing crossencoder, we sample sentence pairs according to a certain sampling strategy and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended training dataset." ;
skos:prefLabel "Augmented SBERT" .
:Auto-Classifier a skos:Concept ;
dcterms:source ;
skos:definition "" ;
skos:prefLabel "Auto-Classifier" .
:AutoAugment a skos:Concept ;
dcterms:source ;
skos:definition """**AutoAugment** is an automated approach to find data augmentation policies from data. It formulates the problem of finding the best augmentation policy as a discrete search problem. It consists of two components: a search algorithm and a search space. \r
\r
At a high level, the search algorithm (implemented as a controller RNN) samples a data augmentation policy $S$, which has information about what image processing operation to use, the probability of using the operation in each batch, and the magnitude of the operation. The policy $S$ is used to train a neural network with a fixed architecture, whose validation accuracy $R$ is sent back to update the controller. Since $R$ is not differentiable, the controller will be updated by policy gradient methods. \r
\r
The operations used are from PIL, a popular Python image library: all functions in PIL that accept an image as input and output an image. It additionally uses two other augmentation techniques: [Cutout](https://paperswithcode.com/method/cutout) and SamplePairing. The operations searched over are ShearX/Y, TranslateX/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout and Sample Pairing.""" ;
skos:prefLabel "AutoAugment" .
:AutoDropout a skos:Concept ;
dcterms:source ;
skos:definition """**AutoDropout** automates the process of designing [dropout](https://paperswithcode.com/method/dropout) patterns using a [Transformer](https://paperswithcode.com/method/transformer) based controller. In this method, a controller learns to generate a dropout pattern at every channel and layer of a target network, such as a [ConvNet](https://paperswithcode.com/methods/category/convolutional-neural-networks) or a Transformer. The target network is then trained with the dropped-out pattern, and its resulting validation performance is used as a signal for the controller to learn from. The resulting pattern is applied to a convolutional output channel, which is a common building block of image recognition models.\r
\r
The controller network generates the tokens to describe the configurations of the dropout pattern. The tokens are generated like words in a language model. For every layer in a ConvNet, a group of 8 tokens need to be made to create a dropout pattern. These 8 tokens are generated sequentially. In the figure above, size, stride, and repeat indicate the size and the tiling of the pattern; rotate, shear_x, and shear_y specify the geometric transformations of the pattern; share_c is a binary deciding whether a pattern is applied to all $C$ channels; and residual is a binary deciding whether the pattern is applied to the residual branch as well. If we need $L$ dropout patterns, the controller will generate $8L$ decisions.""" ;
skos:prefLabel "AutoDropout" .
:AutoEncoder a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """An **Autoencoder** is a bottleneck architecture that turns a high-dimensional input into a latent low-dimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder).\r
\r
Image: [Michael Massi](https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder_schema.png)""" ;
skos:prefLabel "AutoEncoder" .
:AutoGAN a skos:Concept ;
dcterms:source ;
skos:definition "[Neural architecture search](https://paperswithcode.com/method/neural-architecture-search) (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variations and use an RNN controller to guide the search, with parameter sharing and dynamic-resetting to accelerate the process. Inception score is adopted as the reward, and a multi-level search strategy is introduced to perform NAS in a progressive way." ;
skos:prefLabel "AutoGAN" .
:AutoGL a skos:Concept ;
dcterms:source ;
skos:altLabel "Automated Graph Learning" ;
skos:definition "Automated graph learning is a method that aims at discovering the best hyper-parameter and neural architecture configuration for different graph tasks/data without manual design." ;
skos:prefLabel "AutoGL" .
:AutoInt a skos:Concept ;
dcterms:source ;
skos:definition "**AutoInt** is a deep tabular learning method that models high-order feature interactions of input features. AutoInt can be applied to both numerical and categorical input features. Specifically, both the numerical and categorical features are mapped into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled." ;
skos:prefLabel "AutoInt" .
:AutoML-Zero a skos:Concept ;
dcterms:source ;
skos:definition """**AutoML-Zero** is an AutoML technique that aims to search a fine-grained space simultaneously for the model, optimization procedure, initialization, and so on, permitting much less human-design and even allowing the discovery of non-neural network algorithms. It represents ML algorithms as computer programs comprised of three component functions, Setup, Predict, and Learn, that performs initialization, prediction and learning. The instructions in these functions apply basic mathematical operations on a small memory. The operation and memory addresses used by each instruction are free parameters in the search space, as is the size of the component functions. While this reduces expert design, the consequent sparsity means that [random search](https://paperswithcode.com/method/random-search) cannot make enough progress. To overcome this difficulty, the authors use small proxy tasks and migration techniques to build an optimized infrastructure capable of searching through 10,000 models/second/cpu core.\r
\r
Evolutionary methods can find solutions in the AutoML-Zero search space despite its enormous\r
size and sparsity. The authors show that by randomly modifying the programs and periodically selecting the best performing ones on given tasks/datasets, AutoML-Zero discovers reasonable algorithms. They start from empty programs and using data labeled by “teacher” neural networks with random weights, and demonstrate evolution can discover neural networks trained by gradient descent. Following this, they minimize bias toward known algorithms by switching to binary classification tasks extracted from CIFAR-10 and allowing a larger set of possible operations. This discovers interesting techniques like multiplicative interactions, normalized gradient and weight averaging. Finally, they show it is possible for evolution to adapt the algorithm to the type of task provided. For example, [dropout](https://paperswithcode.com/method/dropout)-like operations emerge when the task needs regularization and learning rate decay appears when the task requires faster convergence.""" ;
skos:prefLabel "AutoML-Zero" .
:AutoSmart a skos:Concept ;
dcterms:source ;
skos:definition "**AutoSmart** is AutoML framework for temporal relational data. The framework includes automatic data processing, table merging, feature engineering, and model tuning, integrated with a time&memory control unit." ;
skos:prefLabel "AutoSmart" .
:AutoSync a skos:Concept ;
dcterms:source ;
skos:definition "**AutoSync** is a pipeline for automatically optimizing synchronization strategies, given model structures and resource specifications, in data-parallel distributed machine learning. By factorizing the synchronization strategy with respect to each trainable building block of a DL model, we can construct a valid and large strategy space spanned by multiple factors. AutoSync efficiently navigates the space and locates the optimal strategy. AutoSync leverages domain knowledge about synchronization systems to reduce the search space, and is equipped with a domain adaptive simulator, which combines principled communication modeling and data-driven ML models, to estimate the runtime of strategy proposals without launching real distributed execution." ;
skos:prefLabel "AutoSync" .
:AutoTinyBERT a skos:Concept ;
dcterms:source ;
skos:definition "**AutoTinyBERT** is a an efficient [BERT](https://paperswithcode.com/method/bert) variant found through neural architecture search. Specifically, one-shot learning is used to obtain a big Super Pretrained Language Model (SuperPLM), where the objectives of pre-training or task-agnostic BERT distillation are used. Then, given a specific latency constraint, an evolutionary algorithm is run on the SuperPLM to search optimal architectures. Finally, we extract the corresponding sub-models based on the optimal architectures and further train these models." ;
skos:prefLabel "AutoTinyBERT" .
:AuxiliaryBatchNormalization a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**Auxiliary Batch Normalization** is a type of regularization used in adversarial training schemes. The idea is that adversarial examples should have a separate [batch normalization](https://paperswithcode.com/method/batch-normalization) components to the clean examples, as they have different underlying statistics." ;
skos:prefLabel "Auxiliary Batch Normalization" .
:AuxiliaryClassifier a skos:Concept ;
skos:definition "**Auxiliary Classifiers** are type of architectural component that seek to improve the convergence of very deep networks. They are classifier heads we attach to layers before the end of the network. The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem. They are notably used in the Inception family of convolutional neural networks." ;
skos:prefLabel "Auxiliary Classifier" .
:AveragePooling a skos:Concept ;
skos:definition """**Average Pooling** is a pooling operation that calculates the average value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. It extracts features more smoothly than [Max Pooling](https://paperswithcode.com/method/max-pooling), whereas max pooling extracts more pronounced features like edges.\r
\r
Image Source: [here](https://www.researchgate.net/figure/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max_fig2_333593451)""" ;
skos:prefLabel "Average Pooling" .
:AxialAttention a skos:Concept ;
dcterms:source ;
skos:definition """**Axial Attention** is a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. It was first proposed in [CCNet](https://paperswithcode.com/method/ccnet) [1] named as criss-cross attention, which harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Ho et al [2] extents CCNet to process multi-dimensional data. The proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. It serves as the basic building block for developing self-attention-based autoregressive models for high-dimensional data tensors, e.g., Axial Transformers. It has been applied in [AlphaFold](https://paperswithcode.com/method/alphafold) [3] for interpreting protein sequences.\r
\r
[1] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, Wenyu Liu. CCNet: Criss-Cross Attention for Semantic Segmentation. ICCV, 2019.\r
\r
[2] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans. arXiv:1912.12180\r
\r
[3] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Jul 15:1-1.""" ;
skos:prefLabel "Axial Attention" .
:BAGUA a skos:Concept ;
dcterms:source ;
skos:definition "**BAGUA** is a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. The abstraction goes beyond parameter server and Allreduce paradigms, and provides a collection of MPI-style collective operations to facilitate communications with different precision and centralization strategies." ;
skos:prefLabel "BAGUA" .
:BAM a skos:Concept ;
dcterms:source ;
skos:altLabel "Bottleneck Attention Module" ;
skos:definition """Park et al. proposed the bottleneck attention module (BAM), aiming\r
to efficiently improve the representational capability of networks. \r
It uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested by ResNet to save computational cost.\r
\r
For a given input feature map $X$, BAM infers the channel attention $s_c \\in \\mathbb{R}^C$ and spatial attention $s_s\\in \\mathbb{R}^{H\\times W}$ in two parallel streams, then sums the two attention maps after resizing both branch outputs to $\\mathbb{R}^{C\\times H \\times W}$. The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as\r
\\begin{align}\r
s_c &= \\text{BN}(W_2(W_1\\text{GAP}(X)+b_1)+b_2)\r
\\end{align}\r
\r
\\begin{align}\r
s_s &= BN(Conv_2^{1 \\times 1}(DC_2^{3\\times 3}(DC_1^{3 \\times 3}(Conv_1^{1 \\times 1}(X))))) \r
\\end{align}\r
\\begin{align}\r
s &= \\sigma(\\text{Expand}(s_s)+\\text{Expand}(s_c)) \r
\\end{align}\r
\\begin{align}\r
Y &= s X+X\r
\\end{align}\r
where $W_i$, $b_i$ denote weights and biases of fully connected layers respectively, $Conv_{1}^{1\\times 1}$ and $Conv_{2}^{1\\times 1}$ are convolution layers used for channel reduction. $DC_i^{3\\times 3}$ denotes a dilated convolution with $3\\times 3$ kernel, applied to utilize contextual information effectively. $\\text{Expand}$ expands the attention maps $s_s$ and $s_c$ to $\\mathbb{R}^{C\\times H\\times W}$.\r
\r
BAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.""" ;
skos:prefLabel "BAM" .
:BART a skos:Concept ;
dcterms:source ;
skos:definition "**BART** is a [denoising autoencoder](https://paperswithcode.com/method/denoising-autoencoder) for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard [Transformer](https://paperswithcode.com/method/transformer)-based neural machine translation architecture. It uses a standard seq2seq/NMT architecture with a bidirectional encoder (like [BERT](https://paperswithcode.com/method/bert)) and a left-to-right decoder (like [GPT](https://paperswithcode.com/method/gpt)). This means the encoder's attention mask is fully visible, like BERT, and the decoder's attention mask is causal, like [GPT2](https://paperswithcode.com/method/gpt-2)." ;
skos:prefLabel "BART" .
:BASE a skos:Concept ;
dcterms:source ;
skos:altLabel "Balanced Selection" ;
skos:definition "" ;
skos:prefLabel "BASE" .
:BASNet a skos:Concept ;
dcterms:source ;
skos:altLabel "Boundary-Aware Segmentation Network" ;
skos:definition """**BASNet**, or **Boundary-Aware Segmentation Network**, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation. The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual \r
refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.""" ;
skos:prefLabel "BASNet" .
:BERT a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**BERT**, or Bidirectional Encoder Representations from Transformers, improves upon standard [Transformers](http://paperswithcode.com/method/transformer) by removing the unidirectionality constraint by using a *masked language model* (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a *next sentence prediction* task that jointly pre-trains text-pair representations. \r
\r
There are two steps in BERT: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they\r
are initialized with the same pre-trained parameters.""" ;
skos:prefLabel "BERT" .
:BIDeN a skos:Concept ;
dcterms:source ;
skos:altLabel "Blind Image Decomposition Network" ;
skos:definition """**BIDeN**, or **Blind Image Decomposition Network**, is a model for blind image decomposition, which requires separating a superimposed image into constituent underlying images in a blind setting, that is, both the source components involved in mixing as well as the mixing mechanism are unknown. For example, rain may consist of multiple components, such as rain streaks, raindrops, snow, and haze. \r
\r
The Figure shows an example where $N = 4, L = 2, x = {a, b, c, d}$, and $I = {1, 3}$. $a, c$ are selected then passed to the mixing function $f$, and outputs the mixed input image $z$, which is $f\\left(a, c\\right)$ here. The generator consists of an encoder $E$ with three branches and multiple heads $H$. $\\bigotimes$ denotes the concatenation operation. Depth and receptive field of each branch is different to capture multiple scales of features. Each specified head points to the corresponding source component, and the number of heads varies with the maximum number of source components N. All reconstructed images $\\left(a', c'\\right)$ and their corresponding real images $\\left(a, c\\right)$ are sent to an unconditional discriminator. The discriminator also predicts the source components of the input image $z$. The outputs from other heads $\\left(b', d'\\right)$ do not contribute to the optimization.""" ;
skos:prefLabel "BIDeN" .
:BIMAN a skos:Concept ;
dcterms:source ;
skos:definition "**BIMAN**, or **Bot Identification by commit Message, commit Association, and author Name**, is a technique to detect bots that commit code. It is comprised of three methods that consider independent aspects of the commits made by a particular author: 1) Commit Message: Identify if commit messages are being generated from templates; 2) Commit Association: Predict if an author is a bot using a random forest model, with features related to files and projects associated with the commits as predictors; and 3) Author Name: Match author’s name and email to common bot patterns." ;
skos:prefLabel "BIMAN" .
:BLANC a skos:Concept ;
dcterms:source ;
skos:definition "**BLANC** is an automatic estimation approach for document summary quality. The goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. BLANC achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document's text." ;
skos:prefLabel "BLANC" .
:BLIP a skos:Concept ;
dcterms:source ;
skos:altLabel "BLIP: Bootstrapping Language-Image Pre-training" ;
skos:definition "Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP." ;
skos:prefLabel "BLIP" .
:BLOOM a skos:Concept ;
dcterms:source ;
skos:definition """**BLOOM** is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of\r
sources in 46 natural and 13 programming languages (59 in total).""" ;
skos:prefLabel "BLOOM" .
:BLOOMZ a skos:Concept ;
dcterms:source ;
skos:definition "**BLOOMZ** is a Multitask prompted finetuning (MTF) variant of BLOOM." ;
skos:prefLabel "BLOOMZ" .
:BP-Transformer a skos:Concept ;
dcterms:source ;
skos:definition """The **BP-Transformer (BPT)** is a type of [Transformer](https://paperswithcode.com/method/transformer) that is motivated by the need to find a better balance between capability and computational complexity for self-attention. The architecture partitions the input sequence into different multi-scale spans via binary partitioning (BP). It incorporates an inductive bias of attending the context information from fine-grain to coarse-grain as the relative distance increases. The farther the context information is, the coarser its representation is.\r
BPT can be regard as graph neural network, whose nodes are the multi-scale spans. A token node can attend the smaller-scale span for the closer context and the larger-scale span for the longer distance context. The representations of nodes are updated with [Graph Self-Attention](https://paperswithcode.com/method/graph-self-attention).""" ;
skos:prefLabel "BP-Transformer" .
:BPE a skos:Concept ;
dcterms:source ;
skos:altLabel "Byte Pair Encoding" ;
skos:definition """**Byte Pair Encoding**, or **BPE**, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).\r
\r
[Lei Mao](https://leimao.github.io/blog/Byte-Pair-Encoding/) has a detailed blog post that explains how this works.""" ;
skos:prefLabel "BPE" .
:BRepNet a skos:Concept ;
dcterms:source ;
skos:definition "**BRepNet** is a neural network for CAD applications. It is designed to operate directly on B-rep data structures, avoiding the need to approximate the model as meshes or point clouds. BRepNet defines convolutional kernels with respect to oriented coedges in the data structure. In the neighborhood of each coedge, a small collection of faces, edges and coedges can be identified and patterns in the feature vectors from these entities detected by specific learnable parameters." ;
skos:prefLabel "BRepNet" .
:BS-Net a skos:Concept ;
dcterms:source ;
skos:definition """**BS-Net** is an architecture for COVID-19 severity prediction based on clinical data from different modalities. The architecture comprises 1) a shared multi-task feature extraction backbone, 2) a lung segmentation branch, 3) an original registration mechanism that acts as a ”multi-resolution feature alignment” block operating on the encoding backbone , and 4) a multi-regional classification part for the final six-valued score estimation. \r
\r
All these blocks act together in the final training thanks to a loss specifically crated for this task. This loss guarantees also performance robustness, comprising a differentiable version of the target discrete metric. The learning phase operates in a weakly-supervised fashion. This is due to the fact that difficulties and pitfalls in the visual interpretation of the disease signs on CXRs (spanning from subtle findings to heavy lung impairment), and the lack of detailed localization information, produces unavoidable inter-rater variability among radiologists in assigning scores.\r
\r
Specifically the architectural details are:\r
\r
- The input image is processed with a convolutional backbone; the authors opt for a [ResNet](https://paperswithcode.com/method/resnet)-18.\r
- Segmentation is performed by a nested version of [U-Net](https://paperswithcode.com/method/u-net) (U-Net++).\r
- Alignment is estimated through the segmentation probability map produced by the U-Net++ decoder, which is achieved through a [spatial transformer network](https://paperswithcode.com/method/spatial-transformer) -- able to estimate the spatial transform matrix in order to center, rotate, and correctly zoom the lungs. After alignment at various scales, features are forward to a [ROIPool](https://paperswithcode.com/method/roi-pooling). \r
- The alignment block is pre-trained on the synthetic alignment dataset in a weakly-supervised setting, using a Dice loss.\r
- The scoring head uses [FPNs](https://paperswithcode.com/method/fpn) for the combination of multi-scale feature maps. The multiresolution feature aligner produces input feature maps that are well focused on the specific area of interest. Eventually, the output of the FPN layer flows in a series of convolutional blocks to retrieve the output map. The classification is performed by a final [Global Average Pooling](https://paperswithcode.com/method/global-average-pooling) layer and a [SoftMax](https://paperswithcode.com/method/softmax) activation.\r
- The Loss function used for training is a sparse categorical cross entropy (SCCE) with a (differentiable) mean absolute error contribution.""" ;
skos:prefLabel "BS-Net" .
:BTF a skos:Concept ;
dcterms:source ;
skos:altLabel "Back to the Feature" ;
skos:definition "" ;
skos:prefLabel "BTF" .
:BTmPG a skos:Concept ;
dcterms:source ;
skos:definition """**BTmPG**, or **Back-Translation guided multi-round Paraphrase Generation**, is a multi-round paraphrase generation method that leverages back-translation to guide paraphrase model during training and generates paraphrases in a multiround process. The model regards paraphrase generation as a monolingual translation task. Given a paraphrase pair $\\left(S\\_{0}, P\\right)$, which $S\\_{0}$ is the original/source sentence and $P$ is the target paraphrase given in the dataset. In the first round generation, we send $S\\_{0}$ into a paraphrase model to generate a paraphrase $S\\_{1}$. In the second round generation, we use the $S\\_{1}$ as the input of the model to generate a new paraphrase $S\\_{2}$. And so forth, in the $i$-th round generation, we send $S\\_{i−1}$ into the paraphrase model to generate $S\\_{i}$.\r
.""" ;
skos:prefLabel "BTmPG" .
:BYOL a skos:Concept ;
dcterms:source ;
skos:altLabel "Bootstrap Your Own Latent" ;
skos:definition """BYOL (Bootstrap Your Own Latent) is a new approach to self-supervised learning. BYOL’s goal is to learn a representation $y_θ$ which can then be used for downstream tasks. BYOL uses two neural networks to learn: the online and target networks. The online network is defined by a set of weights $θ$ and is comprised of three stages: an encoder $f_θ$, a projector $g_θ$ and a predictor $q_θ$. The target network has the same architecture\r
as the online network, but uses a different set of weights $ξ$. The target network provides the regression\r
targets to train the online network, and its parameters $ξ$ are an exponential moving average of the\r
online parameters $θ$.\r
\r
Given the architecture diagram on the right, BYOL minimizes a similarity loss between $q_θ(z_θ)$ and $sg(z'{_ξ})$, where $θ$ are the trained weights, $ξ$ are an exponential moving average of $θ$ and $sg$ means stop-gradient. At the end of training, everything but $f_θ$ is discarded, and $y_θ$ is used as the image representation.\r
\r
Source: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https://paperswithcode.com/paper/bootstrap-your-own-latent-a-new-approach-to-1)\r
\r
Image credit: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https://paperswithcode.com/paper/bootstrap-your-own-latent-a-new-approach-to-1)""" ;
skos:prefLabel "BYOL" .
:BalancedFeaturePyramid a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Balanced Feature Pyramid** is a feature pyramid module. It differs from approaches like [FPNs](https://paperswithcode.com/method/fpn) that integrate multi-level features using lateral connections. Instead the BFP strengthens the multi-level features using the same deeply integrated balanced semantic features. The pipeline is shown in the Figure to the right. It consists of four steps, rescaling, integrating, refining and strengthening.\r
\r
Features at resolution level $l$ are denoted as $C\\_{l}$. The number of multi-level features is denoted as $L$. The indexes of involved lowest and highest levels are denoted as $l\\_{min}$ and $l\\_{max}$. In the Figure, $C\\_{2}$ has the highest resolution. To integrate multi-level features and preserve their semantic hierarchy at the same time, we first resize the multi-level features {$C\\_{2}, C\\_{3}, C\\_{4}, C\\_{5}$} to an intermediate size, i.e., the same size as $C\\_{4}$, with interpolation and max-pooling respectively. Once the features are rescaled, the balanced semantic features are obtained by simple averaging as:\r
\r
$$ C = \\frac{1}{L}\\sum^{l\\_{max}}\\_{l=l\\_{min}}C\\_{l} $$\r
\r
The obtained features are then rescaled using the same but reverse procedure to strengthen the original features. Each resolution obtains equal information from others in this procedure. Note that this procedure does not contain any parameter. The authors observe improvement with this nonparametric method, proving the effectiveness of the information flow. \r
\r
The balanced semantic features can be further refined to be more discriminative. The authors found both the refinements with convolutions directly and the non-local module work well. But the\r
non-local module works in a more stable way. Therefore, embedded Gaussian non-local attention is utilized as default. The refining step helps us enhance the integrated features and further improve the results.\r
\r
With this method, features from low-level to high-level are aggregated at the same time. The outputs\r
{$P\\_{2}, P\\_{3}, P\\_{4}, P\\_{5}$} are used for object detection following the same pipeline in FPN.""" ;
skos:prefLabel "Balanced Feature Pyramid" .
:BalancedL1Loss a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Balanced L1 Loss** is a loss function used for the object detection task. Classification and localization problems are solved simultaneously under the guidance of a multi-task loss since\r
[Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn), defined as:\r
\r
$$ L\\_{p,u,t\\_{u},v} = L\\_{cls}\\left(p, u\\right) + \\lambda\\left[u \\geq 1\\right]L\\_{loc}\\left(t^{u}, v\\right) $$\r
\r
$L\\_{cls}$ and $L\\_{loc}$ are objective functions corresponding to recognition and localization respectively. Predictions and targets in $L\\_{cls}$ are denoted as $p$ and $u$. $t\\_{u}$ is the corresponding regression results with class $u$. $v$ is the regression target. $\\lambda$ is used for tuning the loss weight under multi-task learning. We call samples with a loss greater than or equal to 1.0 outliers. The other samples are called inliers.\r
\r
A natural solution for balancing the involved tasks is to tune the loss weights of them. However, owing to the unbounded regression targets, directly raising the weight of localization loss will make the model more sensitive to outliers. These outliers, which can be regarded as hard samples, will produce excessively large gradients that are harmful to the training process. The inliers, which can be regarded as the easy samples, contribute little gradient to the overall gradients compared with the outliers. To be more specific, inliers only contribute 30% gradients average per sample compared with outliers. Considering these issues, the authors introduced the balanced L1 loss, which is denoted as $L\\_{b}$.\r
\r
Balanced L1 loss is derived from the conventional smooth L1 loss, in which an inflection point is set to separate inliers from outliners, and clip the large gradients produced by outliers with a maximum value of 1.0, as shown by the dashed lines in the Figure to the right. The key idea of balanced L1 loss is promoting the crucial regression gradients, i.e. gradients from inliers (accurate samples), to rebalance\r
the involved samples and tasks, thus achieving a more balanced training within classification, overall localization and accurate localization. Localization loss $L\\_{loc}$ uses balanced L1 loss is defined as:\r
\r
$$ L\\_{loc} = \\sum\\_{i\\in{x,y,w,h}}L\\_{b}\\left(t^{u}\\_{i}-v\\_{i}\\right) $$\r
\r
The Figure to the right shows that the balanced L1 loss increases the gradients of inliers under the control of a factor denoted as $\\alpha$. A small $\\alpha$ increases more gradient for inliers, but the gradients of outliers are not influenced. Besides, an overall promotion magnification controlled by γ is also brought in for tuning the upper bound of regression errors, which can help the objective function better balancing involved tasks. The two factors that control different aspects are mutually enhanced to reach a more balanced training.$b$ is used to ensure $L\\_{b}\\left(x = 1\\right)$ has the same value for both formulations in the equation below.\r
\r
By integrating the gradient formulation above, we can get the balanced L1 loss as:\r
\r
$$ L\\_{b}\\left(x\\right) = \\frac{\\alpha}{b}\\left(b|x| + 1\\right)ln\\left(b|x| + 1\\right) - \\alpha|x| \\text{ if } |x| < 1$$\r
\r
$$ L\\_{b}\\left(x\\right) = \\gamma|x| + C \\text{ otherwise } $$\r
\r
in which the parameters $\\gamma$, $\\alpha$, and $b$ are constrained by $\\alpha\\text{ln}\\left(b + 1\\right) = \\gamma$. The default parameters are set as $\\alpha = 0.5$ and $\\gamma = 1.5$""" ;
skos:prefLabel "Balanced L1 Loss" .
:BarlowTwins a skos:Concept ;
dcterms:source ;
skos:definition "**Barlow Twins** is a self-supervised learning method that applies redundancy-reduction — a principle first proposed in neuroscience — to self supervised learning. The objective function measures the cross-correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted version of a sample to be similar, while minimizing the redundancy between the components of these vectors. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors." ;
skos:prefLabel "Barlow Twins" .
:BaseBoosting a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """In the setting of multi-target regression, base boosting permits us to incorporate prior knowledge into the learning mechanism of gradient boosting (or Newton boosting, etc.). Namely, from the vantage of statistics, base boosting is a way of building the following additive expansion in a set of elementary basis functions:\r
\\begin{equation}\r
h_{j}(X ; \\{ \\alpha_{j}, \\theta_{j} \\}) = X_{j} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r
\\end{equation}\r
where \r
$X$ is an example from the domain $\\mathcal{X},$\r
$\\{\\alpha_{j}, \\theta_{j}\\} = \\{\\alpha_{j,1},\\dots, \\alpha_{j,K_{j}},\\theta_{j,1},\\dots,\\theta_{j,K_{j}}\\}$ collects the expansion coefficients and parameter sets,\r
$X_{j}$ is the image of $X$ under the $j$th coordinate function (a prediction from a user-specified model),\r
$K_{j}$ is the number of basis functions in the linear sum,\r
$b(X; \\theta_{j,k})$ is a real-valued function of the example $X,$ characterized by a parameter set $\\theta_{j,k}.$\r
\r
The aforementioned additive expansion differs from the \r
[standard additive expansion](https://projecteuclid.org/download/pdf_1/euclid.aos/1013203451):\r
\\begin{equation}\r
h_{j}(X ; \\{ \\alpha_{j}, \\theta_{j}\\}) = \\alpha_{j, 0} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r
\\end{equation}\r
as it replaces the constant offset value $\\alpha_{j, 0}$ with a prediction from a user-specified model. In essence, this modification permits us to incorporate prior knowledge into the for loop of gradient boosting, as the for loop proceeds to build the linear sum by computing residuals that depend upon predictions from the user-specified model instead of the optimal constant model: $\\mbox{argmin} \\sum_{i=1}^{m_{train}} \\ell_{j}(Y_{j}^{(i)}, c),$ where $m_{train}$ denotes the number of training examples, $\\ell_{j}$ denotes a single-target loss function, and $c \\in \\mathbb{R}$ denotes a real number, e.g, $\\mbox{argmin} \\sum_{i=1}^{m_{train}} (Y_{j}^{(i)} - c)^{2} = \\frac{\\sum_{i=1}^{m_{train}} Y_{j}^{(i)}}{m_{train}}.$""" ;
skos:prefLabel "Base Boosting" .
:BasicVSR a skos:Concept ;
dcterms:source ;
skos:definition "**BasicVSR** is a video super-resolution pipeline including optical flow and [residual blocks](https://paperswithcode.com/method/residual-connection). It adopts a typical bidirectional recurrent network. The upsampling module $U$ contains multiple [pixel-shuffle](https://paperswithcode.com/method/pixelshuffle) and convolutions. In the Figure, red and blue colors represent the backward and forward propagations, respectively. The propagation branches contain only generic components. $S, W$, and $R$ refer to the flow estimation module, spatial warping module, and residual blocks, respectively." ;
skos:prefLabel "BasicVSR" .
:BatchChannelNormalization a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**Batch-Channel Normalization**, or **BCN**, uses batch knowledge to prevent channel-normalized models from getting too close to \"elimination singularities\". Elimination singularities correspond to the points on the training trajectory where neurons become consistently deactivated. They cause degenerate manifolds in the loss landscape which will slow down training and harm model performances." ;
skos:prefLabel "BatchChannel Normalization" .
:BatchFormer a skos:Concept ;
dcterms:source ;
skos:altLabel "Batch Transformer" ;
skos:definition "learn to explore the sample relationships via transformer networks" ;
skos:prefLabel "BatchFormer" .
:BatchNormalization a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Batch Normalization** aims to reduce internal covariate shift, and in doing so aims to accelerate the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows for use of much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for [Dropout](https://paperswithcode.com/method/dropout).\r
\r
We apply a batch normalization layer as follows for a minibatch $\\mathcal{B}$:\r
\r
$$ \\mu\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}x\\_{i} $$\r
\r
$$ \\sigma^{2}\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}\\left(x\\_{i}-\\mu\\_{\\mathcal{B}}\\right)^{2} $$\r
\r
$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{\\mathcal{B}}}{\\sqrt{\\sigma^{2}\\_{\\mathcal{B}}+\\epsilon}} $$\r
\r
$$ y\\_{i} = \\gamma\\hat{x}\\_{i} + \\beta = \\text{BN}\\_{\\gamma, \\beta}\\left(x\\_{i}\\right) $$\r
\r
Where $\\gamma$ and $\\beta$ are learnable parameters.""" ;
skos:prefLabel "Batch Normalization" .
:BatchNuclear-normMaximization a skos:Concept ;
dcterms:source ;
skos:definition "**Batch Nuclear-norm Maximization** is an approach for aiding classification in label insufficient situations. It involves maximizing the nuclear-norm of the batch output matrix. The nuclear-norm of a matrix is an upper bound of the Frobenius-norm of the matrix. Maximizing nuclear-norm ensures large Frobenius-norm of the batch matrix, which leads to increased discriminability. The nuclear-norm of the batch matrix is also a convex approximation of the matrix rank, which refers to the prediction diversity." ;
skos:prefLabel "Batch Nuclear-norm Maximization" .
:Batchboost a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**Batchboost** is a variation on [MixUp](https://paperswithcode.com/method/mixup) that instead of mixing just two images, mixes many images together." ;
skos:prefLabel "Batchboost" .
:BayesianREX a skos:Concept ;
dcterms:source ;
skos:altLabel "Bayesian Reward Extrapolation" ;
skos:definition "**Bayesian Reward Extrapolation** is a Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference." ;
skos:prefLabel "Bayesian REX" .
:Beta-VAE a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Beta-VAE** is a type of variational autoencoder that seeks to discover disentangled latent factors. It modifies [VAEs](https://paperswithcode.com/method/vae) with an adjustable hyperparameter $\\beta$ that balances latent channel capacity and independence constraints with reconstruction accuracy. The idea is to maximize the probability of generating the real data while keeping the distance between the real and estimated distributions small, under a threshold $\\epsilon$. We can use the Kuhn-Tucker conditions to write this as a single equation:\r
\r
$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\left[D\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right) - \\epsilon\\right]$$\r
\r
where the KKT multiplier $\\beta$ is the regularization coefficient that constrains the capacity of the latent channel $\\mathbf{z}$ and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior $p\\left(\\mathbf{z}\\right)$.\r
\r
We write this again using the complementary slackness assumption to get the Beta-VAE formulation:\r
\r
$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) \\geq \\mathcal{L}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\{D}\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right)$$""" ;
skos:prefLabel "Beta-VAE" .
:BezierAlign a skos:Concept ;
dcterms:source ;
skos:definition """**BezierAlign** is a feature sampling method for arbitrarily-shaped scene text recognition that exploits parameterization nature of a compact Bezier curve bounding box. Unlike RoIAlign, the shape of sampling grid of BezierAlign is not rectangular. Instead, each column of the arbitrarily-shaped grid is orthogonal to the Bezier curve boundary of the text. The sampling points have equidistant interval in width and height, respectively, which are bilinear interpolated with respect to the coordinates.\r
\r
Formally given an input feature map and Bezier curve control points, we concurrently process all the output pixels of the rectangular output feature map with size $h\\_{\\text {out }} \\times w\\_{\\text {out }}$. Taking pixel $g\\_{i}$ with position $\\left(g\\_{i w}, g\\_{i h}\\right)$ (from output feature map) as an example, we calculate $t$ by:\r
\r
$$\r
t=\\frac{g\\_{i w}}{w\\_{o u t}}\r
$$\r
\r
We then calculate the point of upper Bezier curve boundary $tp$ and lower Bezier curve boundary $bp$. Using $tp$ and $bp$, we can linearly index the sampling point $op$ by:\r
\r
$$\r
op=bp \\cdot \\frac{g\\_{i h}}{h\\_{\\text {out }}}+tp \\cdot\\left(1-\\frac{g\\_{i h}}{h\\_{\\text {out }}}\\right)\r
$$\r
\r
With the position of $op$, we can easily apply bilinear interpolation to calculate the result. Comparisons among previous sampling methods and BezierAlign are shown in the Figure.""" ;
skos:prefLabel "BezierAlign" .
:Bi-attention a skos:Concept ;
dcterms:source ;
skos:altLabel "Bilinear Attention" ;
skos:definition "Bi-attention employs the attention-in-attention (AiA) mechanism to capture second-order statistical information: the outer point-wise channel attention vectors are computed from the output of the inner channel attention." ;
skos:prefLabel "Bi-attention" .
:Bi3D a skos:Concept ;
dcterms:source ;
skos:definition "**Bi3D** is a stereo depth estimation framework that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular depth *D*, as existing stereo methods do, it classifies them as being closer or farther than *D*. It takes the stereo pair and a disparity $d\\_{i}$ and produces a confidence map, which can be thresholded to yield the binary segmentation. To estimate depth on $N + 1$ quantization levels we run this network $N$ times and maximize the probability in Equation 8 (see paper). To estimate continuous depth, whether full or selective, we run the [SegNet](https://paperswithcode.com/method/segnet) block of Bi3DNet for each disparity level and work directly on the confidence volume." ;
skos:prefLabel "Bi3D" .
:BiDet a skos:Concept ;
dcterms:source ;
skos:definition "**BiDet** is a binarized neural network learning method for efficient object detection. Conventional network binarization methods directly quantize the weights and activations in one-stage or two-stage detectors with constrained representational capacity, so that the information redundancy in the networks causes numerous false positives and degrades the performance significantly. On the contrary, BiDet fully utilizes the representational capacity of the binary neural networks for object detection by redundancy removal, through which the detection precision is enhanced with alleviated false positives. Specifically, the information bottleneck (IB) principle is generalized to object detection, where the amount of information in the high-level feature maps is constrained and the mutual information between the feature maps and object detection is maximized." ;
skos:prefLabel "BiDet" .
:BiFPN a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """A **BiFPN**, or **Weighted Bi-directional Feature Pyramid Network**, is a type of feature pyramid network which allows easy and fast multi-scale feature fusion. It incorporates the multi-level feature fusion idea from [FPN](https://paperswithcode.com/method/fpn), [PANet](https://paperswithcode.com/method/panet) and [NAS-FPN](https://paperswithcode.com/method/nas-fpn) that enables information to flow in both the top-down and bottom-up directions, while using regular and efficient connections. It also utilizes a fast normalized fusion technique. Traditional approaches usually treat all features input to the FPN equally, even those with different resolutions. However, input features at different resolutions often have unequal contributions to the output features. Thus, the BiFPN adds an additional weight for each input feature allowing the network to learn the importance of each. All regular convolutions are also replaced with less expensive depthwise separable convolutions.\r
\r
Comparing with PANet, PANet added an extra bottom-up path for information flow at the expense of more computational cost. Whereas BiFPN optimizes these cross-scale connections by removing nodes with a single input edge, adding an extra edge from the original input to output node if they are on the same level, and treating each bidirectional path as one feature network layer (repeating it several times for more high-level future fusion).""" ;
skos:prefLabel "BiFPN" .
:BiGAN a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:altLabel "Bidirectional GAN" ;
skos:definition """A **BiGAN**, or **Bidirectional GAN**, is a type of generative adversarial network where the generator not only maps latent samples to generated data, but also has an inverse mapping from data to the latent representation. The motivation is to make a type of GAN that can learn rich representations for us in applications like unsupervised learning.\r
\r
In addition to the generator $G$ from the standard [GAN](https://paperswithcode.com/method/gan) framework, BiGAN includes an encoder $E$ which maps data $\\mathbf{x}$ to latent representations $\\mathbf{z}$. The BiGAN discriminator $D$ discriminates not only in data space ($\\mathbf{x}$ versus $G\\left(\\mathbf{z}\\right)$), but jointly in data and latent space (tuples $\\left(\\mathbf{x}, E\\left(\\mathbf{x}\\right)\\right)$ versus $\\left(G\\left(z\\right), z\\right)$), where the latent component is either an encoder output $E\\left(\\mathbf{x}\\right)$ or a generator input $\\mathbf{z}$.""" ;
skos:prefLabel "BiGAN" .
:BiGCN a skos:Concept ;
dcterms:source ;
skos:altLabel "Bi-Directional Graph Convolutional Network" ;
skos:definition "" ;
skos:prefLabel "BiGCN" .
:BiGG a skos:Concept ;
dcterms:source ;
skos:definition "**BiGG** is an autoregressive model for generative modeling for sparse graphs. It utilizes sparsity to avoid generating the full adjacency matrix, and reduces the graph generation time complexity to $O(((n + m)\\log n)$. Furthermore, during training this autoregressive model can be parallelized with $O(\\log n)$ synchronization stages, which makes it much more efficient than other autoregressive models that require $\\Omega(n)$. The approach is based on three key elements: (1) an $O(\\log n)$ process for generating each edge using a binary tree data structure, inspired by R-MAT; (2) a tree-structured autoregressive model for generating the set of edges associated with each node; and (3) an autoregressive model defined over the sequence of nodes." ;
skos:prefLabel "BiGG" .
:BiGRU a skos:Concept ;
skos:altLabel "Bidirectional GRU" ;
skos:definition """A **Bidirectional GRU**, or **BiGRU**, is a sequence processing model that consists of two [GRUs](https://paperswithcode.com/method/gru). one taking the input in a forward direction, and the other in a backwards direction. It is a bidirectional recurrent neural network with only the input and forget gates.\r
\r
Image Source: *Rana R (2016). Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech.*""" ;
skos:prefLabel "BiGRU" .
:BiLSTM a skos:Concept ;
skos:altLabel "Bidirectional LSTM" ;
skos:definition """A **Bidirectional LSTM**, or **biLSTM**, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow *and* precede a word in a sentence).\r
\r
Image Source: Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks, Cornegruta et al""" ;
skos:prefLabel "BiLSTM" .
:BiSeNetV2 a skos:Concept ;
dcterms:source ;
skos:definition "**BiSeNet V2** is a two-pathway architecture for real-time semantic segmentation. One pathway is designed to capture the spatial details with wide channels and shallow layers, called Detail Branch. In contrast, the other pathway is introduced to extract the categorical semantics with narrow channels and deep layers, called Semantic Branch. The Semantic Branch simply requires a large receptive field to capture semantic context, while the detail information can be supplied by the Detail Branch. Therefore, the Semantic Branch can be made very lightweight with fewer channels and a fast-downsampling strategy. Both types of feature representation are merged to construct a stronger and more comprehensive feature representation." ;
skos:prefLabel "BiSeNet V2" .
:Big-LittleModule a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**Big-Little Modules** are blocks for image models that have two branches: each of which represents a separate block from a deep model and a less deep counterpart. They were proposed as part of the [BigLittle-Net](https://paperswithcode.com/method/big-little-net) architecture. The two branches are fused with a linear combination and unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution)." ;
skos:prefLabel "Big-Little Module" .
:Big-LittleNet a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**Big-Little Net** is a convolutional neural network architecture for learning multi-scale feature representations. This is achieved by using a multi-branch network, which has different computational complexity at different branches with different resolutions. Through frequent merging of features from branches at distinct scales, the model obtains multi-scale features while using less computation.\r
\r
It consists of Big-Little Modules, which have two branches: each of which represents a separate block from a deep model and a less deep counterpart. The two branches are fused with linear combination + unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).""" ;
skos:prefLabel "Big-Little Net" .
:BigBiGAN a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition "**BigBiGAN** is a type of [BiGAN](https://paperswithcode.com/method/bigan) with a [BigGAN](https://paperswithcode.com/method/biggan) image generator. The authors initially used [ResNet](https://paperswithcode.com/method/resnet) as a baseline for the encoder $\\mathcal{E}$ followed by a 4-layer MLP with skip connections, but they experimented with RevNets and found they outperformed with increased network width, so opted for this type of encoder for the final architecture." ;
skos:prefLabel "BigBiGAN" .
:BigBird a skos:Concept ;
dcterms:source ;
skos:definition """**BigBird** is a [Transformer](https://paperswithcode.com/method/transformer) with a sparse attention mechanism that reduces the quadratic dependency of self-attention to linear in the number of tokens. BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. In particular, BigBird consists of three main parts:\r
\r
- A set of $g$ global tokens attending on all parts of the sequence.\r
- All tokens attending to a set of $w$ local neighboring tokens.\r
- All tokens attending to a set of $r$ random tokens.\r
\r
This leads to a high performing attention mechanism scaling to much longer sequence lengths (8x).""" ;
skos:prefLabel "BigBird" .
:BigGAN a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**BigGAN** is a type of generative adversarial network that was designed for scaling generation to high-resolution, high-fidelity images. It includes a number of incremental changes and innovations. The baseline and incremental changes are:\r
\r
- Using [SAGAN](https://paperswithcode.com/method/sagan) as a baseline with spectral norm. for G and D, and using [TTUR](https://paperswithcode.com/method/ttur).\r
- Using a Hinge Loss [GAN](https://paperswithcode.com/method/gan) objective\r
- Using class-[conditional batch normalization](https://paperswithcode.com/method/conditional-batch-normalization) to provide class information to G (but with linear projection not MLP.\r
- Using a [projection discriminator](https://paperswithcode.com/method/projection-discriminator) for D to provide class information to D.\r
- Evaluating with EWMA of G's weights, similar to ProGANs.\r
\r
The innovations are:\r
\r
- Increasing batch sizes, which has a big effect on the Inception Score of the model.\r
- Increasing the width in each layer leads to a further Inception Score improvement.\r
- Adding skip connections from the latent variable $z$ to further layers helps performance.\r
- A new variant of [Orthogonal Regularization](https://paperswithcode.com/method/orthogonal-regularization).""" ;
skos:prefLabel "BigGAN" .
:BigGAN-deep a skos:Concept ;
dcterms:source ;
rdfs:seeAlso ;
skos:definition """**BigGAN-deep** is a deeper version (4x) of [BigGAN](https://paperswithcode.com/method/biggan). The main difference is a slightly differently designed [residual block](https://paperswithcode.com/method/residual-block). Here the $z$ vector is concatenated with the conditional vector without splitting it into chunks. It is also based on residual blocks with bottlenecks. BigGAN-deep uses a different strategy than BigGAN aimed at preserving identity throughout the skip connections. In G, where the number of channels needs to be reduced, BigGAN-deep simply retains the first group of channels and drop the rest to produce the required number of channels. In D, where the number of channels should be increased, BigGAN-deep passes the input channels unperturbed, and concatenates them with the remaining channels produced by a 1 × 1 [convolution](https://paperswithcode.com/method/convolution). As far as the\r
network configuration is concerned, the discriminator is an exact reflection of the generator. \r
\r
There are two blocks at each resolution (BigGAN uses one), and as a result BigGAN-deep is four times\r
deeper than BigGAN. Despite their increased depth, the BigGAN-deep models have significantly\r
fewer parameters mainly due to the bottleneck structure of their residual blocks.""" ;
skos:prefLabel "BigGAN-deep" .
:BilateralGrid a skos:Concept ;
dcterms:source