@prefix : . @prefix dcterms: . @prefix owl: . @prefix rdfs: . @prefix skos: . a owl:Ontology . a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **(2+1)D Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a [3D Convolution](https://paperswithcode.com/method/3d-convolution) over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution." ; skos:prefLabel "(2+1)D Convolution" . :1-bitAdam a skos:Concept ; dcterms:source ; skos:definition "**1-bit Adam** is a [stochastic optimization](https://paperswithcode.com/methods/category/stochastic-optimization) technique that is a variant of [ADAM](https://paperswithcode.com/method/adam) with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term $\\mathbf{v}$ and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as $\\frac{\\text { magnitude of compensated gradient }}{\\text { magnitude of quantized gradient }}$. This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by $97 \\%$ and $94 \\%$ compared to the original float 32 and float 16 training, respectively." ; skos:prefLabel "1-bit Adam" . :1-bitLAMB a skos:Concept ; dcterms:source ; skos:definition """**1-bit LAMB** is a communication-efficient stochastic optimization technique which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. Learning from the insights behind [1-bit Adam](https://paperswithcode.com/method/1-bit-adam), it is a a 2-stage algorithm which uses [LAMB](https://paperswithcode.com/method/lamb) (warmup stage) to “pre-condition” a communication compressed momentum SGD algorithm (compression stage). At compression stage where original LAMB algorithm cannot be used to update the layerwise learning rates, 1-bit LAMB employs a novel way to adaptively scale layerwise learning rates based on information from both warmup and compression stages. As a result, 1-bit LAMB is able to achieve large batch optimization (LAMB)’s convergence speed under compressed communication.\r \r There are two major differences between 1-bit LAMB and the original LAMB:\r \r - During compression stage, 1-bit LAMB updates the layerwise learning rate based on a novel “reconstructed gradient” based on the compressed momentum. This makes 1-bit LAMB compatible with error compensation and be able to keep track of the training dynamic under compression.\r - 1-bit LAMB also introduces extra stabilized soft thresholds when updating layerwise learning rate at compression stage, which makes training more stable under compression.""" ; skos:prefLabel "1-bit LAMB" . :1DCNN a skos:Concept ; dcterms:source ; skos:altLabel "1-Dimensional Convolutional Neural Networks" ; skos:definition "1D Convolutional Neural Networks are similar to well known and more established 2D Convolutional Neural Networks. 1D Convolutional Neural Networks are used mainly used on text and 1D signals." ; skos:prefLabel "1D CNN" . :1cycle a skos:Concept ; dcterms:source ; skos:altLabel "1cycle learning rate scheduling policy" ; skos:definition "" ; skos:prefLabel "1cycle" . :1x1Convolution a skos:Concept ; dcterms:source ; skos:definition """A **1 x 1 Convolution** is a [convolution](https://paperswithcode.com/method/convolution) with some special properties in that it can be used for dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after convolutions. It maps an input pixel with all its channels to an output pixel which can be squeezed to a desired output depth. It can be viewed as an [MLP](https://paperswithcode.com/method/feedforward-network) looking at a particular pixel location.\r \r Image Credit: [http://deeplearning.ai](http://deeplearning.ai)""" ; skos:prefLabel "1x1 Convolution" . :2DDWT a skos:Concept ; dcterms:source ; skos:altLabel "2D Discrete Wavelet Transform" ; skos:definition "" ; skos:prefLabel "2D DWT" . :3-Augment a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "3-Augment" . :3DConvolution a skos:Concept ; rdfs:seeAlso ; skos:definition """A **3D Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) where the kernel slides in 3 dimensions as opposed to 2 dimensions with 2D convolutions. One example use case is medical imaging where a model is constructed using 3D image slices. Additionally video based data has an additional temporal dimension over images making it suitable for this module. \r \r Image: Lung nodule detection based on 3D convolutional neural networks, Fan et al""" ; skos:prefLabel "3D Convolution" . :3DDynamicSceneGraph a skos:Concept ; dcterms:source ; skos:definition "**3D Dynamic Scene Graph**, or **DSG**, is a representation that captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes." ; skos:prefLabel "3D Dynamic Scene Graph" . :3DIS a skos:Concept ; dcterms:source ; skos:altLabel "3-dimensional interaction space" ; skos:definition """A **trainable 3D interaction space** aims to captures the associations between the triplet components and helps model the recognition of multiple triplets in the same frame.\r \r Source: [Nwoye et al.](https://arxiv.org/pdf/2007.05405v1.pdf)\r \r Image source: [Nwoye et al.](https://arxiv.org/pdf/2007.05405v1.pdf)""" ; skos:prefLabel "3DIS" . :3DResNet-RS a skos:Concept ; dcterms:source ; skos:definition """**3D ResNet-RS** is an architecture and scaling strategy for 3D ResNets for video recognition. The key additions are:\r \r - **3D ResNet-D stem**: The [ResNet-D](https://paperswithcode.com/method/resnet-d) stem is adapted to 3D inputs by using three consecutive [3D convolutional layers](https://paperswithcode.com/method/3d-convolution). The first convolutional layer employs a temporal kernel size of 5 while the remaining two convolutional layers employ a temporal kernel size of 1.\r \r - **3D Squeeze-and-Excitation**: [Squeeze-and-Excite](https://paperswithcode.com/method/squeeze-and-excitation-block) is adapted to spatio-temporal inputs by using a 3D [global average pooling](https://paperswithcode.com/method/global-average-pooling) operation for the squeeze operation. A SE ratio of 0.25 is applied in each 3D bottleneck block for all experiments.\r \r - **Self-gating**: A self-gating module is used in each 3D bottleneck block after the SE module.""" ; skos:prefLabel "3D ResNet-RS" . :3DSA a skos:Concept ; dcterms:source ; skos:altLabel "3 Dimensional Soft Attention" ; skos:definition "" ; skos:prefLabel "3D SA" . :3DSSD a skos:Concept ; dcterms:source ; skos:definition "**3DSSD** is a point-based 3D single stage object detection detector. In this paradigm, all upsampling layers and refinement stage, which are indispensable in all existing point-based methods, are abandoned to reduce the large computation cost. The authors propose a fusion sampling strategy in the downsampling process to make detection on less representative points feasible. A delicate box prediction network including a candidate generation layer, an anchor-free regression head with a 3D center-ness assignment strategy is designed to meet the needs of accuracy and speed." ; skos:prefLabel "3DSSD" . a skos:Concept ; dcterms:source ; skos:altLabel "Four-dimensional A-star" ; skos:definition "The aim of 4D A* is to find the shortest path between two four-dimensional (4D) nodes of a 4D search space - a starting node and a target node - as long as there is a path. It achieves both optimality and completeness. The former is because the path is shortest possible, and the latter because if the solution exists the algorithm is guaranteed to find it." ; skos:prefLabel "4D A*" . :A2C a skos:Concept ; dcterms:source ; skos:definition """**A2C**, or **Advantage Actor Critic**, is a synchronous version of the [A3C](https://paperswithcode.com/method/a3c) policy gradient method. As an alternative to the asynchronous implementation of A3C, A2C is a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before updating, averaging over all of the actors. This more effectively uses GPUs due to larger batch sizes.\r \r Image Credit: [OpenAI Baselines](https://openai.com/blog/baselines-acktr-a2c/)""" ; skos:prefLabel "A2C" . :A3C a skos:Concept ; dcterms:source ; skos:definition """**A3C**, **Asynchronous Advantage Actor Critic**, is a policy gradient algorithm in reinforcement learning that maintains a policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and an estimate of the value\r function $V\\left(s\\_{t}; \\theta\\_{v}\\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t\\_{\\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\\nabla\\_{\\theta{'}}\\log\\pi\\left(a\\_{t}\\mid{s\\_{t}}; \\theta{'}\\right)A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ where $A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ is an estimate of the advantage function given by:\r \r $$\\sum^{k-1}\\_{i=0}\\gamma^{i}r\\_{t+i} + \\gamma^{k}V\\left(s\\_{t+k}; \\theta\\_{v}\\right) - V\\left(s\\_{t}; \\theta\\_{v}\\right)$$\r \r where $k$ can vary from state to state and is upper-bounded by $t\\_{max}$.\r \r The critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.\r \r Note that while the parameters $\\theta$ of the policy and $\\theta\\_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one [softmax](https://paperswithcode.com/method/softmax) output for the policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and one linear output for the value function $V\\left(s\\_{t}; \\theta\\_{v}\\right)$, with all non-output layers shared.""" ; skos:prefLabel "A3C" . :ABC a skos:Concept ; dcterms:source ; skos:altLabel "Approximate Bayesian Computation" ; skos:definition """Class of methods in Bayesian Statistics where the posterior distribution is approximated over a rejection scheme on simulations because the likelihood function is intractable.\r \r Different parameters get sampled and simulated. Then a distance function is calculated to measure the quality of the simulation compared to data from real observations. Only simulations that fall below a certain threshold get accepted.\r \r Image source: [Kulkarni et al.](https://www.umass.edu/nanofabrics/sites/default/files/PDF_0.pdf)""" ; skos:prefLabel "ABC" . :ABCNet a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Bezier-Curve Network" ; skos:definition "**Adaptive Bezier-Curve Network**, or **ABCNet**, is an end-to-end framework for arbitrarily-shaped scene text spotting. It adaptively fits arbitrary-shaped text by a parameterized bezier curve. It also utilizes a feature alignment layer, [BezierAlign](https://paperswithcode.com/method/bezieralign), to calculate convolutional features of text instances in curved shapes. These features are then passed to a light-weight recognition head." ; skos:prefLabel "ABCNet" . :ACER a skos:Concept ; dcterms:source ; skos:definition """**ACER**, or **Actor Critic with Experience Replay**, is an actor-critic deep reinforcement learning agent with [experience replay](https://paperswithcode.com/method/experience-replay). It can be seen as an off-policy extension of [A3C](https://paperswithcode.com/method/a3c), where the off-policy estimator is made feasible by:\r \r - Using [Retrace](https://paperswithcode.com/method/retrace) Q-value estimation.\r - Using truncated importance sampling with bias correction.\r - Using a trust region policy optimization method.\r - Using a [stochastic dueling network](https://paperswithcode.com/method/stochastic-dueling-network) architecture.""" ; skos:prefLabel "ACER" . :ACGPN a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Content Generating and Preserving Network" ; skos:definition """**ACGPN**, or **Adaptive Content Generating and Preserving Network**, is a [generative adversarial network](https://www.paperswithcode.com/method/category/generative-adversarial-network) for virtual try-on clothing applications. \r \r In Step I, the Semantic Generation Module (SGM) takes the target clothing image $\\mathcal{T}\\_{c}$, the pose map $\\mathcal{M}\\_{p}$, and the fused body part mask $\\mathcal{M}^{F}$ as the input to predict the semantic layout and to output the synthesized body part mask $\\mathcal{M}^{S}\\_{\\omega}$ and the target clothing mask $\\mathcal{M}^{S\\_{c}$.\r \r In Step II, the Clothes Warping Module (CWM) warps the target clothing image to $\\mathcal{T}^{R}\\_{c}$ according to the predicted semantic layout, where a second-order difference constraint is introduced to stabilize the warping process. \r \r In Steps III and IV, the Content Fusion Module (CFM) first produces the composited body part mask $\\mathcal{M}^{C}\\_{\\omega}$ using the original clothing mask $\\mathcal{M}\\_{c}$, the synthesized clothing mask $\\mathcal{M}^{S}\\_{c}$, the body part mask $\\mathcal{M}\\_{\\omega}$, and the synthesized body part mask $\\mathcal{M}\\_{\\omega}^{S}$, and then exploits a fusion network to generate the try-on images $\\mathcal{I}^{S}$ by utilizing the information $\\mathcal{T}^{R}\\_{c}$, $\\mathcal{M}^{S}\\_{c}$, and the body part image $I\\_{\\omega}$ from previous steps.""" ; skos:prefLabel "ACGPN" . :ACTKR a skos:Concept ; dcterms:source ; skos:definition """**ACKTR**, or **Actor Critic with Kronecker-factored Trust Region**, is an actor-critic method for reinforcement learning that applies [trust region optimization](https://paperswithcode.com/method/trpo) using a recently proposed Kronecker-factored approximation to the curvature. The method extends the framework of natural policy gradient and optimizes both the actor and the critic using Kronecker-factored approximate\r curvature (K-FAC) with trust region.""" ; skos:prefLabel "ACTKR" . :ADELE a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Early-Learning Correction" ; skos:definition "Adaptive Early-Learning Correction for Segmentation from Noisy Annotations" ; skos:prefLabel "ADELE" . :ADMM a skos:Concept ; skos:altLabel "Alternating Direction Method of Multipliers" ; skos:definition """The **alternating direction method of multipliers** (**ADMM**) is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle. It takes the form of a decomposition-coordination procedure, in which the solutions to small\r local subproblems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization. It turns out to be equivalent or closely related to many other algorithms\r as well, such as Douglas-Rachford splitting from numerical analysis, Spingarn’s method of partial inverses, Dykstra’s alternating projections method, Bregman iterative algorithms for l1 problems in signal processing, proximal methods, and many others.\r \r Text Source: [https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf](https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf)\r \r Image Source: [here](https://www.slideshare.net/derekcypang/alternating-direction)""" ; skos:prefLabel "ADMM" . :AE a skos:Concept ; skos:altLabel "Autoencoders" ; skos:definition """An **autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. \r \r Extracted from: [Wikipedia](https://en.wikipedia.org/wiki/Autoencoder)\r \r Image source: [Wikipedia](https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder_schema.png)""" ; skos:prefLabel "AE" . :AEDA a skos:Concept ; dcterms:source ; skos:altLabel "An Easier Data Augmentation" ; skos:definition "**AEDA**, or **An Easier Data Augmentation**, is a type of data augmentation technique for text classification which includes only the insertion of various punctuation marks into the input sequence. AEDA preserves all the input information and does not mislead the network since it keeps the word order intact while changing their positions in that the words are shifted to the right." ; skos:prefLabel "AEDA" . :AGCN a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Graph Convolutional Neural Networks" ; skos:definition """AGCN is a novel spectral graph convolution network that feed on original data of diverse graph structures.\r \r Image credit: [Adaptive Graph Convolutional Neural Networks](https://arxiv.org/pdf/1801.03226.pdf)""" ; skos:prefLabel "AGCN" . :AHAF a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Hybrid Activation Function" ; skos:definition "Trainable activation function as a sigmoid-based generalization of ReLU, Swish and SiLU." ; skos:prefLabel "AHAF" . :ALAE a skos:Concept ; dcterms:source ; skos:altLabel "Adversarial Latent Autoencoder" ; skos:definition "**ALAE**, or **Adversarial Latent Autoencoder**, is a type of autoencoder that attempts to overcome some of the limitations of[ generative adversarial networks](https://paperswithcode.com/paper/generative-adversarial-networks). The architecture allows the latent distribution to be learned from data to address entanglement (A). The output data distribution is learned with an adversarial strategy (B). Thus, we retain the generative properties of GANs, as well as the ability to build on the recent advances in this area. For instance, we can include independent sources of stochasticity, which have proven essential for generating image details, or can leverage recent improvements on GAN loss functions, regularization, and hyperparameters tuning. Finally, to implement (A) and (B), AE reciprocity is imposed in the latent space (C). Therefore, we can avoid using reconstruction losses based on simple $\\mathcal{l}\\){2}$ norm that operates in data space, where they are often suboptimal, like for the image space. Since it works on the latent space, rather than autoencoding the data space, the approach is named Adversarial Latent Autoencoder (ALAE)." ; skos:prefLabel "ALAE" . :ALBEF a skos:Concept ; dcterms:source ; skos:definition "ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data." ; skos:prefLabel "ALBEF" . :ALBERT a skos:Concept ; dcterms:source ; skos:definition """**ALBERT** is a [Transformer](https://paperswithcode.com/method/transformer) architecture based on [BERT](https://paperswithcode.com/method/bert) but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. \r \r Additionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.""" ; skos:prefLabel "ALBERT" . :ALCN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Adaptive Locally Connected Neuron" ; skos:definition """The **Adaptive Locally Connected Neuron (ALCN)** is a topology aware, and locally adaptive -focusing neuron:\r \r $$a = f\\:\\Bigg( \\sum_{i=1}^{m} w_{i}\\phi\\left( \\tau\\left(i\\right),\\Theta\\right) x_{i} + b \\Bigg) %f\\:\\Bigg(\\mathbf{X(W \\circ \\Phi) + b} \\Bigg) $$""" ; skos:prefLabel "ALCN" . :ALDA a skos:Concept ; dcterms:source ; skos:definition "**Adversarial-Learned Loss for Domain Adaptation** is a method for domain adaptation that combines adversarial learning with self-training. Specifically, the domain discriminator has to produce different corrected labels for different domains, while the feature generator aims to confuse the domain discriminator. The adversarial process finally leads to a proper confusion matrix on the target domain. In this way, ALDA takes the strengths of domain-adversarial learning and self-training based methods." ; skos:prefLabel "ALDA" . :ALDEN a skos:Concept ; dcterms:source ; skos:definition """**ALDEN**, or **Active Learning with DivErse iNterpretations**, is an active learning approach for text classification. With local interpretations in DNNs, ALDEN identifies linearly separable regions of samples. Then, it selects samples according to their diversity of local interpretations and queries their labels.\r \r Specifically, we first calculate the local interpretations in DNN for each sample as the gradient backpropagated from the final\r predictions to the input features. Then, we use the most diverse interpretation of words in a sample to measure its diverseness. Accordingly, we select unlabeled samples with the maximally diverse interpretations for labeling and retrain the model with these\r labeled samples.""" ; skos:prefLabel "ALDEN" . :ALI a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Adversarially Learned Inference" ; skos:definition """**Adversarially Learned Inference (ALI)** is a generative modelling approach that casts the learning of both an inference machine (or encoder) and a deep directed generative model (or decoder) in an GAN-like adversarial framework. A discriminator is trained to discriminate joint samples of the data and the corresponding latent variable from the encoder (or approximate posterior) from joint samples from the decoder while in opposition, the encoder and the decoder are trained together to fool the discriminator. Not is the discriminator asked to distinguish synthetic samples from real data, but it is required it to distinguish between two joint distributions over the data space and the latent variables.\r \r An ALI differs from a [GAN](https://paperswithcode.com/method/gan) in two ways:\r \r - The generator has two components: the encoder, $G\\_{z}\\left(\\mathbf{x}\\right)$, which maps data samples $x$ to $z$-space, and the decoder $G\\_{x}\\left(\\mathbf{z}\\right)$, which maps samples from the prior $p\\left(\\mathbf{z}\\right)$ (a source of noise) to the input space.\r - The discriminator is trained to distinguish between joint pairs $\\left(\\mathbf{x}, \\tilde{\\mathbf{z}} = G\\_{\\mathbf{x}}\\left(\\mathbf{x}\\right)\\right)$ and $\\left(\\tilde{\\mathbf{x}} =\r G\\_{x}\\left(\\mathbf{z}\\right), \\mathbf{z}\\right)$, as opposed to marginal samples $\\mathbf{x} \\sim q\\left(\\mathbf{x}\\right)$ and $\\tilde{\\mathbf{x}} ∼ p\\left(\\mathbf{x}\\right)$.""" ; skos:prefLabel "ALI" . :ALIGN a skos:Concept ; dcterms:source ; skos:definition "In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries." ; skos:prefLabel "ALIGN" . :ALIS a skos:Concept ; dcterms:source ; skos:altLabel "Aligning Latent and Image Spaces" ; skos:definition "An infinite image generator which is based on a patch-wise, periodically equivariant generator." ; skos:prefLabel "ALIS" . :ALP-GMM a skos:Concept ; dcterms:source ; skos:altLabel "Absolute Learning Progress and Gaussian Mixture Models for Automatic Curriculum Learning" ; skos:definition "ALP-GMM is is an algorithm that learns to generate a learning curriculum for black box reinforcement learning agents, whereby it sequentially samples parameters controlling a stochastic procedural generation of tasks or environments." ; skos:prefLabel "ALP-GMM" . :ALQandAMQ a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Gradient Quantization with Adaptive Levels/Multiplier" ; skos:definition "Many communication-efficient variants of [SGD](https://paperswithcode.com/method/sgd) use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters." ; skos:prefLabel "ALQ and AMQ" . :ALS a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Label Smoothing" ; skos:definition "" ; skos:prefLabel "ALS" . :ALiBi a skos:Concept ; dcterms:source ; skos:altLabel "Attention with Linear Biases" ; skos:definition """**ALiBi**, or **Attention with Linear Biases**, is a [positioning method](https://paperswithcode.com/methods/category/position-embeddings) that allows [Transformer](https://paperswithcode.com/methods/category/transformers) language models to consume, at inference time, sequences which are longer than the ones they were trained on. \r \r ALiBi does this without using actual position embeddings. Instead, computing the attention between a certain key and query, ALiBi penalizes the attention value that that query can assign to the key depending on how far away the key and query are. So when a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high. \r \r This method was motivated by the simple reasoning that words that are close-by matter much more than ones that are far away.\r \r This method is as fast as the sinusoidal or absolute embedding methods (the fastest positioning methods there are). It outperforms those methods and Rotary embeddings when evaluating sequences that are longer than the ones the model was trained on (this is known as extrapolation).""" ; skos:prefLabel "ALiBi" . :AM a skos:Concept ; dcterms:source ; skos:altLabel "Attention Model" ; skos:definition "" ; skos:prefLabel "AM" . :AMP a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Adversarial Model Perturbation" ; skos:definition "Based on the understanding that the flat local minima of the empirical risk cause the model to generalize better. Adversarial Model Perturbation (AMP) improves generalization via minimizing the **AMP loss**, which is obtained from the empirical risk by applying the **worst** norm-bounded perturbation on each point in the parameter space." ; skos:prefLabel "AMP" . :AMSBound a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**AMSBound** is a variant of the [AMSGrad](https://paperswithcode.com/method/amsgrad) stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AMSBound can be regarded as an adaptive method at the beginning of training, and it gradually and smoothly transforms to [SGD](https://paperswithcode.com/method/sgd) (or with momentum) as time step increases. \r \r $$ g\\_{t} = \\nabla{f}\\_{t}\\left(x\\_{t}\\right) $$\r \r $$ m\\_{t} = \\beta\\_{1t}m\\_{t-1} + \\left(1-\\beta\\_{1t}\\right)g\\_{t} $$\r \r $$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r \r $$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) \\text{ and } V\\_{t} = \\text{diag}\\left(\\hat{v}\\_{t}\\right) $$\r \r $$ \\eta = \\text{Clip}\\left(\\alpha/\\sqrt{V\\_{t}}, \\eta\\_{l}\\left(t\\right), \\eta\\_{u}\\left(t\\right)\\right) \\text{ and } \\eta\\_{t} = \\eta/\\sqrt{t} $$\r \r $$ x\\_{t+1} = \\Pi\\_{\\mathcal{F}, \\text{diag}\\left(\\eta\\_{t}^{-1}\\right)}\\left(x\\_{t} - \\eta\\_{t} \\odot m\\_{t} \\right) $$\r \r Where $\\alpha$ is the initial step size, and $\\eta_{l}$ and $\\eta_{u}$ are the lower and upper bound functions respectively.""" ; skos:prefLabel "AMSBound" . :AMSGrad a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**AMSGrad** is a stochastic optimization method that seeks to fix a convergence issue with [Adam](https://paperswithcode.com/method/adam) based optimizers. AMSGrad uses the maximum of past squared gradients \r $v\\_{t}$ rather than the exponential average to update the parameters:\r \r $$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r \r $$v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r \r $$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) $$\r \r $$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{\\hat{v}_{t}} + \\epsilon}m\\_{t}$$""" ; skos:prefLabel "AMSGrad" . :APPO a skos:Concept ; dcterms:source ; skos:altLabel "Asynchronous Proximal Policy Optimization" ; skos:definition "" ; skos:prefLabel "APPO" . :ARCH a skos:Concept ; dcterms:source ; skos:altLabel "Animatable Reconstruction of Clothed Humans" ; skos:definition "**Animatable Reconstruction of Clothed Humans** is an end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features." ; skos:prefLabel "ARCH" . :ARM-Net a skos:Concept ; dcterms:source ; skos:definition "ARM-Net is an adaptive relation modeling network tailored for structured data, and a lightweight framework ARMOR based on ARM-Net for relational data analytics. The key idea is to model feature interactions with cross features selectively and dynamically, by first transforming the input features into exponential space, and then determining the interaction order and interaction weights adaptively for each cross feature. The authors propose a novel sparse attention mechanism to dynamically generate the interaction weights given the input tuple, so that we can explicitly model cross features of arbitrary orders with noisy features filtered selectively. Then during model inference, ARM-Net can specify the cross features being used for each prediction for higher accuracy and better interpretability." ; skos:prefLabel "ARM-Net" . :ARMA a skos:Concept ; dcterms:source ; skos:altLabel "ARMA GNN" ; skos:definition "The ARMA GNN layer implements a rational graph filter with a recursive approximation." ; skos:prefLabel "ARMA" . :ARShoe a skos:Concept ; dcterms:source ; skos:definition "**ARShoe** is a multi-branch network for pose estimation and segmentation tackling the \"try-on\" problem for augmented reality shoes. Consisting of an encoder and a decoder, the multi-branch network is trained to predict keypoints [heatmap](https://paperswithcode.com/method/heatmap) (heatmap), [PAFs](https://paperswithcode.com/method/pafs) heatmap (pafmap), and segmentation results (segmap) simultaneously. Post processes are then performed for a smooth and realistic virtual try-on." ; skos:prefLabel "ARShoe" . :ARiA a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Richard's Curve Weighted Activation" ; skos:definition "This work introduces a novel activation unit that can be efficiently employed in deep neural nets (DNNs) and performs significantly better than the traditional Rectified Linear Units ([ReLU](https://paperswithcode.com/method/relu)). The function developed is a two parameter version of the specialized Richard's Curve and we call it Adaptive Richard's Curve weighted Activation (ARiA). This function is non-monotonous, analogous to the newly introduced [Swish](https://paperswithcode.com/method/swish), however allows a precise control over its non-monotonous convexity by varying the hyper-parameters. We first demonstrate the mathematical significance of the two parameter ARiA followed by its application to benchmark problems such as MNIST, CIFAR-10 and CIFAR-100, where we compare the performance with ReLU and Swish units. Our results illustrate a significantly superior performance on all these datasets, making ARiA a potential replacement for ReLU and other activations in DNNs." ; skos:prefLabel "ARiA" . :ASAF a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Spline Activation Function" ; skos:definition """Stefano Guarnieri, Francesco Piazza, and Aurelio Uncini \r "Multilayer Feedforward Networks with Adaptive Spline Activation Function," \r IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999\r \r Abstract — In this paper, a new adaptive spline activation function neural network (ASNN) is presented. Due to the ASNN’s high representation capabilities, networks with a small number of interconnections can be trained to solve both pattern recognition and data processing real-time problems. The main idea is to use a Catmull–Rom cubic spline as the neuron’s activation function, which ensures a simple structure suitable for both software and hardware implementation. Experimental results demonstrate improvements in terms of generalization capability\r and of learning speed in both pattern recognition and data processing tasks.\r Index Terms— Adaptive activation functions, function shape autotuning, generalization, generalized sigmoidal functions, multilayer\r perceptron, neural networks, spline neural networks.""" ; skos:prefLabel "ASAF" . :ASFF a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Adaptively Spatial Feature Fusion" ; skos:definition """**ASFF**, or **Adaptively Spatial Feature Fusion**, is a method for pyramidal feature fusion. It learns the way to spatially filter conflictive information to suppress inconsistency across different feature scales, thus improving the scale-invariance of features. \r \r ASFF enables the network to directly learn how to spatially filter features at other levels so that only useful information is kept for combination. For the features at a certain level, features of other levels are first integrated and resized into the same resolution and then trained to find the optimal fusion. At each spatial location, features at different levels are fused adaptively, *i.e.*, some features may be filter out as they carry contradictory information at this location and some may dominate with more discriminative clues. ASFF offers several advantages: (1) as the operation of searching the optimal fusion is differential, it can be conveniently learned in back-propagation; (2) it is agnostic to the backbone model and it is applied to single-shot detectors that have a feature pyramid structure; and (3) its implementation is simple and the increased computational cost is marginal.\r \r Let $\\mathbf{x}_{ij}^{n\\rightarrow l}$ denote the feature vector at the position $(i,j)$ on the feature maps resized from level $n$ to level $l$. Following a feature resizing stage, we fuse the features at the corresponding level $l$ as follows:\r \r $$\r \\mathbf{y}\\_{ij}^l = \\alpha^l_{ij} \\cdot \\mathbf{x}\\_{ij}^{1\\rightarrow l} + \\beta^l_{ij} \\cdot \\mathbf{x}\\_{ij}^{2\\rightarrow l} +\\gamma^l\\_{ij} \\cdot \\mathbf{x}\\_{ij}^{3\\rightarrow l},\r $$\r \r where $\\mathbf{y}\\_{ij}^l$ implies the $(i,j)$-th vector of the output feature maps $\\mathbf{y}^l$ among channels. $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ refer to the spatial importance weights for the feature maps at three different levels to level $l$, which are adaptively learned by the network. Note that $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ can be simple scalar variables, which are shared across all the channels. Inspired by acnet, we force $\\alpha^l\\_{ij}+\\beta^l\\_{ij}+\\gamma^l\\_{ij}=1$ and $\\alpha^l\\_{ij},\\beta^l\\_{ij},\\gamma^l\\_{ij} \\in [0,1]$, and \r \r $$\r \\alpha^l_{ij} = \\frac{e^{\\lambda^l\\_{\\alpha\\_{ij}}}}{e^{\\lambda^l\\_{\\alpha_{ij}}} + e^{\\lambda^l\\_{\\beta_{ij}\r }} + e^{\\lambda^l\\_{\\gamma_{ij}}}}.\r $$\r \r Here $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ are defined by using the [softmax](https://paperswithcode.com/method/softmax) function with $\\lambda^l\\_{\\alpha_{ij}}$, $\\lambda^l\\_{\\beta_{ij}}$ and $\\lambda^l\\_{\\gamma_{ij}}$ as control parameters respectively. We use $1\\times1$ [convolution](https://paperswithcode.com/method/convolution) layers to compute the weight scalar maps $\\mathbf{\\lambda}^l_\\alpha$, $\\mathbf{\\lambda}^l\\_\\beta$ and $\\mathbf{\\lambda}^l\\_\\gamma$ from $\\mathbf{x}^{1\\rightarrow l}$, $\\mathbf{x}^{2\\rightarrow l}$ and $\\mathbf{x}^{3\\rightarrow l}$ respectively, and they can thus be learned through standard back-propagation.\r \r With this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of [YOLOv3](https://paperswithcode.com/method/yolov3).""" ; skos:prefLabel "ASFF" . :ASLFeat a skos:Concept ; dcterms:source ; skos:definition "**ASLFeat** is a convolutional neural network for learning local features that uses deformable convolutional networks to densely estimate and apply local transformation. It also takes advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, it uses a peakiness measurement to relate feature responses and derive more indicative detection scores." ; skos:prefLabel "ASLFeat" . :ASPP a skos:Concept ; dcterms:source ; skos:altLabel "Atrous Spatial Pyramid Pooling" ; skos:definition "**Atrous Spatial Pyramid Pooling (ASPP)** is a semantic segmentation module for resampling a given feature layer at multiple rates prior to [convolution](https://paperswithcode.com/method/convolution). This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, the mapping is implemented using multiple parallel atrous convolutional layers with different sampling rates." ; skos:prefLabel "ASPP" . :ASU a skos:Concept ; dcterms:source ; skos:altLabel "Amplifying Sine Unit: An Oscillatory Activation Function for Deep Neural Networks to Recover Nonlinear Oscillations Efficiently" ; skos:definition "2023" ; skos:prefLabel "ASU" . :ASVI a skos:Concept ; dcterms:source ; skos:altLabel "Automatic Structured Variational Inference" ; skos:definition "**Automatic Structured Variational Inference (ASVI)** is a fully automated method for constructing structured variational families, inspired by the closed-form update in conjugate Bayesian models. These convex-update families incorporate the forward pass of the input probabilistic program and can therefore capture complex statistical dependencies. Convex-update families have the same space and time complexity as the input probabilistic program and are therefore tractable for a very large family of models including both continuous and discrete variables." ; skos:prefLabel "ASVI" . :ATMO a skos:Concept ; dcterms:source ; skos:altLabel "AdapTive Meta Optimizer" ; skos:definition """This method combines multiple optimization techniques like [ADAM](https://paperswithcode.com/method/adam) and [SGD](https://paperswithcode.com/method/sgd) or PADAM. This method can be applied to any couple of optimizers.\r \r Image credit: [Combining Optimization Methods Using an Adaptive Meta Optimizer](https://www.mdpi.com/1999-4893/14/6/186)""" ; skos:prefLabel "ATMO" . :ATSS a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Adaptive Training Sample Selection" ; skos:definition """**Adaptive Training Sample Selection**, or **ATSS**, is a method to automatically select positive and negative samples according to statistical characteristics of object. It bridges the gap between anchor-based and anchor-free detectors. \r \r For each ground-truth box $g$ on the image, we first find out its candidate positive samples. As described in Line $3$ to $6$, on each pyramid level, we select $k$ anchor boxes whose center are closest to the center of $g$ based on L2 distance. Supposing there are $\\mathcal{L}$ feature pyramid levels, the ground-truth box $g$ will have $k\\times\\mathcal{L}$ candidate positive samples. After that, we compute the IoU between these candidates and the ground-truth $g$ as $\\mathcal{D}_g$ in Line $7$, whose mean and standard deviation are computed as $m_g$ and $v_g$ in Line $8$ and Line $9$. With these statistics, the IoU threshold for this ground-truth $g$ is obtained as $t_g=m_g+v_g$ in Line $10$. Finally, we select these candidates whose IoU are greater than or equal to the threshold $t_g$ as final positive samples in Line $11$ to $15$. \r \r Notably ATSS also limits the positive samples' center to the ground-truth box as shown in Line $12$. Besides, if an anchor box is assigned to multiple ground-truth boxes, the one with the highest IoU will be selected. The rest are negative samples.""" ; skos:prefLabel "ATSS" . :AUCC a skos:Concept ; skos:altLabel "Area Under the ROC Curve for Clustering" ; skos:definition "The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well." ; skos:prefLabel "AUCC" . :AUCOResNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Auditory Cortex ResNet" ; skos:definition "The Auditory Cortex ResNet, briefly AUCO ResNet, is proposed and tested. It is a deep neural network architecture especially designed for audio classification trained end-to-end. It is inspired by the architectural organization of rat's auditory cortex, containing also innovations 2 and 3. The network outperforms the state-of-the-art accuracies on a reference audio benchmark dataset without any kind of preprocessing, imbalanced data handling and, most importantly, any kind of data augmentation." ; skos:prefLabel "AUCO ResNet" . :AVSlowFast a skos:Concept ; dcterms:source ; skos:altLabel "Audiovisual SlowFast Network" ; skos:definition "**Audiovisual SlowFast Network**, or **AVSlowFast**, is an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are integrated with a Faster Audio pathway to model vision and sound in a unified representation. Audio and visual features are fused at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, [DropPathway](https://paperswithcode.com/method/droppathway) is used, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, hierarchical audiovisual synchronization is performed to learn joint audiovisual features." ; skos:prefLabel "AVSlowFast" . :AWARE a skos:Concept ; skos:altLabel "Attentive Walk-Aggregating Graph Neural Network" ; skos:definition "We propose to theoretically and empirically examine the effect of incorporating weighting schemes into walk-aggregating GNNs. To this end, we propose a simple, interpretable, and end-to-end supervised GNN model, called AWARE (Attentive Walk-Aggregating GRaph Neural NEtwork), for graph-level prediction. AWARE aggregates the walk information by means of weighting schemes at distinct levels (vertex-, walk-, and graph-level) in a principled manner. By virtue of the incorporated weighting schemes at these different levels, AWARE can emphasize the information important for prediction while diminishing the irrelevant ones—leading to representations that can improve learning performance." ; skos:prefLabel "AWARE" . :AWD-LSTM a skos:Concept ; dcterms:source ; skos:altLabel "ASGD Weight-Dropped LSTM" ; skos:definition "**ASGD Weight-Dropped LSTM**, or **AWD-LSTM**, is a type of recurrent neural network that employs [DropConnect](https://paperswithcode.com/method/dropconnect) for regularization, as well as [NT-ASGD](https://paperswithcode.com/method/nt-asgd) for optimization - non-monotonically triggered averaged [SGD](https://paperswithcode.com/method/sgd) - which returns an average of last iterations of weights. Additional regularization techniques employed include variable length backpropagation sequences, [variational dropout](https://paperswithcode.com/method/variational-dropout), [embedding dropout](https://paperswithcode.com/method/embedding-dropout), [weight tying](https://paperswithcode.com/method/weight-tying), independent embedding/hidden size, [activation regularization](https://paperswithcode.com/method/activation-regularization) and [temporal activation regularization](https://paperswithcode.com/method/temporal-activation-regularization)." ; skos:prefLabel "AWD-LSTM" . :AbsolutePositionEncodings a skos:Concept ; dcterms:source ; skos:definition """**Absolute Position Encodings** are a type of position embeddings for [[Transformer](https://paperswithcode.com/method/transformer)-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d\\_{model}$ as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:\r \r $$ \\text{PE}\\left(pos, 2i\\right) = \\sin\\left(pos/10000^{2i/d\\_{model}}\\right) $$\r \r $$ \\text{PE}\\left(pos, 2i+1\\right) = \\cos\\left(pos/10000^{2i/d\\_{model}}\\right) $$\r \r where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\\pi$ to $10000 \\dot 2\\pi$. This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $\\text{PE}\\_{pos+k}$ can be represented as a linear function of $\\text{PE}\\_{pos}$.\r \r Image Source: [D2L.ai](https://d2l.ai/chapter_attention-mechanisms/self-attention-and-positional-encoding.html)""" ; skos:prefLabel "Absolute Position Encodings" . :AccoMontage a skos:Concept ; dcterms:source ; skos:definition "**AccoMontage** is a model for accompaniment arrangement, a type of music generation task involving intertwined constraints of melody, harmony, texture, and music structure. AccoMontage generates piano accompaniments for folk/pop songs based on a lead sheet (i.e. a melody with chord progression). It first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure deep learning approaches, AccoMontage uses a hybrid pathway, in which rule-based optimization and deep learning are both leveraged." ; skos:prefLabel "AccoMontage" . :Accordion a skos:Concept ; dcterms:source ; skos:definition "**Accordion** is a gradient communication scheduling algorithm that is generic across models while imposing low computational overheads. Accordion inspects the change in the gradient norms to detect critical regimes and adjusts the communication schedule dynamically. Accordion works for both adjusting the gradient compression rate or the batch size without additional parameter tuning." ; skos:prefLabel "Accordion" . :AccumulatingEligibilityTrace a skos:Concept ; skos:definition """An **Accumulating Eligibility Trace** is a type of [eligibility trace](https://paperswithcode.com/method/eligibility-trace) where the trace increments in an accumulative way. For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r \r $$\\mathbf{e\\_{0}} = \\textbf{0}$$\r \r $$\\textbf{e}\\_{t} = \\nabla{\\hat{v}}\\left(S\\_{t}, \\mathbf{\\theta}\\_{t}\\right) + \\gamma\\lambda\\textbf{e}\\_{t}$$""" ; skos:prefLabel "Accumulating Eligibility Trace" . :Accuracy-RobustnessArea\(ARA\) a skos:Concept ; dcterms:source ; skos:altLabel "Accuracy-Robustness Area" ; skos:definition "In the space of adversarial perturbation against classifier accuracy, the ARA is the area between a classifier's curve and the straight line defined by a naive classifier's maximum accuracy. Intuitively, the ARA measures a combination of the classifier’s predictive power and its ability to overcome an adversary. Importantly, when contrasted against existing robustness metrics, the ARA takes into account the classifier’s performance against all adversarial examples, without bounding them by some arbitrary $\\epsilon$." ; skos:prefLabel "Accuracy-Robustness Area (ARA)" . :ActivationNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Activation Normalization** is a type of normalization used for flow-based generative models; specifically it was introduced in the [GLOW](https://paperswithcode.com/method/glow) architecture. An ActNorm layer performs an affine transformation of the activations using a scale and bias parameter per channel, similar to [batch normalization](https://paperswithcode.com/method/batch-normalization). These parameters are initialized such that the post-actnorm activations per-channel have zero mean and unit variance given an initial minibatch of data. This is a form of data dependent initilization. After initialization, the scale and bias are treated as regular trainable parameters that are independent of the data." ; skos:prefLabel "Activation Normalization" . :ActivationRegularization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Activation Regularization (AR)**, or $L\\_{2}$ activation regularization, is regularization performed on activations as opposed to weights. It is usually used in conjunction with [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks). It is defined as:\r \r $$\\alpha{L}\\_{2}\\left(m\\circ{h\\_{t}}\\right) $$\r \r where $m$ is a [dropout](https://paperswithcode.com/method/dropout) mask used by later parts of the model, $L\\_{2}$ is the $L\\_{2}$ norm, and $h_{t}$ is the output of an RNN at timestep $t$, and $\\alpha$ is a scaling coefficient. \r \r When applied to the output of a dense layer, AR penalizes activations that are substantially away from 0, encouraging activations to remain small.""" ; skos:prefLabel "Activation Regularization" . :ActiveConvolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "An **Active Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) which does not have a fixed shape of the receptive field, and can be used to take more diverse forms of receptive fields for convolutions. Its shape can be learned through backpropagation during training. It can be seen as a generalization of convolution; it can define not only all conventional convolutions, but also convolutions with fractional pixel coordinates. We can freely change the shape of the convolution, which provides greater freedom to form CNN structures. Second, the shape of the convolution is learned while training and there is no need to tune it by hand" ; skos:prefLabel "Active Convolution" . :AdaBound a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**AdaBound** is a variant of the [Adam](https://paperswithcode.com/method/adabound) stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AdaBound can be regarded as an adaptive method at the beginning of training, and thereafter it gradually and smoothly transforms to [SGD](https://paperswithcode.com/method/sgd) (or with momentum) as the time step increases. \r \r $$ g\\_{t} = \\nabla{f}\\_{t}\\left(x\\_{t}\\right) $$\r \r $$ m\\_{t} = \\beta\\_{1t}m\\_{t-1} + \\left(1-\\beta\\_{1t}\\right)g\\_{t} $$\r \r $$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2} \\text{ and } V\\_{t} = \\text{diag}\\left(v\\_{t}\\right) $$\r \r $$ \\hat{\\eta}\\_{t} = \\text{Clip}\\left(\\alpha/\\sqrt{V\\_{t}}, \\eta\\_{l}\\left(t\\right), \\eta\\_{u}\\left(t\\right)\\right) \\text{ and } \\eta\\_{t} = \\hat{\\eta}\\_{t}/\\sqrt{t} $$\r \r $$ x\\_{t+1} = \\Pi\\_{\\mathcal{F}, \\text{diag}\\left(\\eta\\_{t}^{-1}\\right)}\\left(x\\_{t} - \\eta\\_{t} \\odot m\\_{t} \\right) $$\r \r Where $\\alpha$ is the initial step size, and $\\eta_{l}$ and $\\eta_{u}$ are the lower and upper bound functions respectively.""" ; skos:prefLabel "AdaBound" . :AdaDelta a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**AdaDelta** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https://paperswithcode.com/method/sgd). It is an extension of [Adagrad](https://paperswithcode.com/method/adagrad) that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$.\r \r Instead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r \r $$E\\left[g^{2}\\right]\\_{t} = \\gamma{E}\\left[g^{2}\\right]\\_{t-1} + \\left(1-\\gamma\\right)g^{2}\\_{t}$$\r \r Usually $\\gamma$ is set to around $0.9$. Rewriting SGD updates in terms of the parameter update vector:\r \r $$ \\Delta\\theta_{t} = -\\eta\\cdot{g\\_{t, i}}$$\r $$\\theta\\_{t+1} = \\theta\\_{t} + \\Delta\\theta_{t}$$\r \r AdaDelta takes the form:\r \r $$ \\Delta\\theta_{t} = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g_{t} $$\r \r The main advantage of AdaDelta is that we do not need to set a default learning rate.""" ; skos:prefLabel "AdaDelta" . :AdaGPR a skos:Concept ; dcterms:source ; skos:definition "**AdaGPR** is an adaptive, layer-wise graph [convolution](https://paperswithcode.com/method/convolution) model. AdaGPR applies adaptive generalized Pageranks at each layer of a [GCNII](https://paperswithcode.com/method/gcnii) model by learning to predict the coefficients of generalized Pageranks using sparse solvers." ; skos:prefLabel "AdaGPR" . :AdaGrad a skos:Concept ; rdfs:seeAlso ; skos:definition """**AdaGrad** is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\\eta$ at each time step $t$ for every parameter $\\theta\\_{i}$ based on the past gradients for $\\theta\\_{i}$: \r \r $$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\frac{\\eta}{\\sqrt{G\\_{t, ii} + \\epsilon}}g\\_{t, i} $$\r \r The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.\r \r Image: [Alec Radford](https://twitter.com/alecrad)""" ; skos:prefLabel "AdaGrad" . :AdaHessian a skos:Concept ; dcterms:source ; skos:altLabel "ADAHESSIAN" ; skos:definition "AdaHessian achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of [ADAM](https://paperswithcode.com/method/adam). In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that AdaHessian: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to ADAM; (ii) outperforms ADAMW for transformers by 0.27/0.33 BLEU score on IWSLT14/WMT14 and 1.8/1.0 PPL on PTB/Wikitext-103; and (iii) achieves 0.032% better score than [AdaGrad](https://paperswithcode.com/method/adagrad) for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of AdaHessian is comparable to first-order methods, and that it exhibits robustness towards its hyperparameters." ; skos:prefLabel "AdaHessian" . :AdaMod a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**AdaMod** is a stochastic optimizer that restricts adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks.\r \r \r The weight updates are performed as:\r \r \r $$ g\\_{t} = \\nabla{f}\\_{t}\\left(\\theta\\_{t-1}\\right) $$\r \r $$ m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r \r $$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2} $$\r \r $$ \\hat{m}\\_{t} = m\\_{t} / \\left(1 - \\beta^{t}\\_{1}\\right)$$\r \r $$ \\hat{v}\\_{t} = v\\_{t} / \\left(1 - \\beta^{t}\\_{2}\\right)$$\r \r $$ \\eta\\_{t} = \\alpha\\_{t} / \\left(\\sqrt{\\hat{v}\\_{t}} + \\epsilon\\right) $$\r \r $$ s\\_{t} = \\beta\\_{3}s\\_{t-1} + (1-\\beta\\_{3})\\eta\\_{t} $$\r \r $$ \\hat{\\eta}\\_{t} = \\text{min}\\left(\\eta\\_{t}, s\\_{t}\\right) $$\r \r $$ \\theta\\_{t} = \\theta\\_{t-1} - \\hat{\\eta}\\_{t}\\hat{m}\\_{t} $$""" ; skos:prefLabel "AdaMod" . :AdaRNN a skos:Concept ; dcterms:source ; skos:definition "**AdaRNN** is an adaptive [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks) that learns an adaptive model through two modules: [Temporal Distribution Characterization](https://paperswithcode.com/method/temporal-distribution-characterization) (TDC) and [Temporal Distribution Matching](https://paperswithcode.com/method/temporal-distribution-matching) (TDM) algorithms. Firstly, to better characterize the distribution information in time-series, TDC splits the training data into $K$ most diverse periods that have a large distribution gap inspired by the principle of maximum entropy. After that, a temporal distribution matching (TDM) algorithm is used to dynamically reduce distribution divergence using a [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks)-based model." ; skos:prefLabel "AdaRNN" . :AdaShift a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**AdaShift** is a type of adaptive stochastic optimizer that decorrelates $v\\_{t}$ and $g\\_{t}$ in [Adam](https://paperswithcode.com/method/adam) by temporal shifting, i.e., using temporally shifted gradient $g\\_{t−n}$ to calculate $v\\_{t}$. The authors argue that an inappropriate correlation between gradient $g\\_{t}$ and the second-moment term $v\\_{t}$ exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.\r \r The AdaShift updates, based on the idea of temporal independence between gradients, are as follows:\r \r $$ g\\_{t} = \\nabla{f\\_{t}}\\left(\\theta\\_{t}\\right) $$\r \r $$ m\\_{t} = \\sum^{n-1}\\_{i=0}\\beta^{i}\\_{1}g\\_{t-i}/\\sum^{n-1}\\_{i=0}\\beta^{i}\\_{1} $$\r \r Then for $i=1$ to $M$:\r \r $$ v\\_{t}\\left[i\\right] = \\beta\\_{2}v\\_{t-1}\\left[i\\right] + \\left(1-\\beta\\_{2}\\right)\\phi\\left(g^{2}\\_{t-n}\\left[i\\right]\\right) $$\r \r $$ \\theta\\_{t}\\left[i\\right] = \\theta\\_{t-1}\\left[i\\right] - \\alpha\\_{t}/\\sqrt{v\\_{t}\\left[i\\right]}\\cdot{m\\_{t}\\left[i\\right]} $$""" ; skos:prefLabel "AdaShift" . :AdaSmooth a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Smooth Optimizer" ; skos:definition """**AdaSmooth** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https://paperswithcode.com/method/sgd). It is an extension of [Adagrad](https://paperswithcode.com/method/adagrad) and [AdaDelta](https://paperswithcode.com/method/adadelta) that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$ while AdaSmooth adaptively selects the size of the window.\r \r Given the window size $M$, the effective ratio is calculated by \r \r $$e_t = \\frac{s_t}{n_t}= \\frac{| x_t - x_{t-M}|}{\\sum_{i=0}^{M-1} | x_{t-i} - x_{t-1-i}|}\\\\\r = \\frac{| \\sum_{i=0}^{M-1} \\Delta x_{t-1-i}|}{\\sum_{i=0}^{M-1} | \\Delta x_{t-1-i}|}.$$\r \r Given the effective ratio, the scaled smoothing constant is obtained by:\r \r $$c_t = ( \\rho_2- \\rho_1) \\times e_t + (1-\\rho_2),$$\r \r The running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r \r $$ E\\left[g^{2}\\right]\\_{t} = c_t^2 \\odot g_{t}^2 + \\left(1-c_t^2 \\right)\\odot E[g^2]_{t-1} $$\r \r Usually $\\rho_1$ is set to around $0.5$ and $\\rho_2$ is set to around 0.99. The update step the follows:\r \r $$ \\Delta x_t = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}} \\odot g_{t}, $$\r \r which is incorporated into the final update:\r \r $$x_{t+1} = x_{t} + \\Delta x_t.$$\r \r The main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.""" ; skos:prefLabel "AdaSmooth" . :AdaSqrt a skos:Concept ; dcterms:source ; skos:definition """**AdaSqrt** is a stochastic optimization technique that is motivated by the observation that methods like [Adagrad](https://paperswithcode.com/method/adagrad) and [Adam](https://paperswithcode.com/method/adam) can be viewed as relaxations of [Natural Gradient Descent](https://paperswithcode.com/method/natural-gradient-descent).\r \r The updates are performed as follows:\r \r $$ t \\leftarrow t + 1 $$\r \r $$ \\alpha\\_{t} \\leftarrow \\sqrt{t} $$\r \r $$ g\\_{t} \\leftarrow \\nabla\\_{\\theta}f\\left(\\theta\\_{t-1}\\right) $$\r \r $$ S\\_{t} \\leftarrow S\\_{t-1} + g\\_{t}^{2} $$\r \r $$ \\theta\\_{t+1} \\leftarrow \\theta\\_{t} + \\eta\\frac{\\alpha\\_{t}g\\_{t}}{S\\_{t} + \\epsilon} $$""" ; skos:prefLabel "AdaSqrt" . :Adabelief a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Adabelief" . :Adafactor a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Adafactor** is a stochastic optimization method based on [Adam](https://paperswithcode.com/method/adam) that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n \\times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$. \r \r Instead of defining the optimization algorithm in terms of absolute step sizes {$\\alpha_t$}$\\_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\\rho_t$}$\\_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant $\\epsilon_2$. The reason for this lower bound is to allow zero-initialized parameters to escape 0. \r \r Proposed hyperparameters are: $\\epsilon\\_{1} = 10^{-30}$, $\\epsilon\\_{2} = 10^{-3}$, $d=1$, $p\\_{t} = \\min\\left(10^{-2}, \\frac{1}{\\sqrt{t}}\\right)$, $\\hat{\\beta}\\_{2\\_{t}} = 1 - t^{-0.8}$.""" ; skos:prefLabel "Adafactor" . :AdamW a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**AdamW** is a stochastic optimization method that modifies the typical implementation of weight decay in [Adam](https://paperswithcode.com/method/adam), by decoupling [weight decay](https://paperswithcode.com/method/weight-decay) from the gradient update. To see this, $L\\_{2}$ regularization in Adam is usually implemented with the below modification where $w\\_{t}$ is the rate of the weight decay at time $t$:\r \r $$ g\\_{t} = \\nabla{f\\left(\\theta\\_{t}\\right)} + w\\_{t}\\theta\\_{t}$$\r \r while AdamW adjusts the weight decay term to appear in the gradient update:\r \r $$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\eta\\left(\\frac{1}{\\sqrt{\\hat{v}\\_{t} + \\epsilon}}\\cdot{\\hat{m}\\_{t}} + w\\_{t, i}\\theta\\_{t, i}\\right), \\forall{t}$$""" ; skos:prefLabel "AdamW" . :Adapter a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Adapter" . :AdaptiveBins a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Bins" ; skos:definition "" ; skos:prefLabel "AdaptiveBins" . :AdaptiveDropout a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Adaptive Dropout** is a regularization technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [Dropout](https://paperswithcode.com/method/dropout) will ignore this confidence and drop the unit out 50% of the time. \r \r Denote the activity of unit $j$ in a deep neural network by $a\\_{j}$ and assume that its inputs are {$a\\_{i}: i < j$}. In dropout, $a\\_{j}$ is randomly set to zero with probability 0.5. Let $m\\_{j}$ be a binary variable that is used to mask, the activity $a\\_{j}$, so that its value is:\r \r $$ a\\_{j} = m\\_{j}g \\left( \\sum\\_{i: i ; rdfs:seeAlso ; skos:definition """**Adaptive Feature Pooling** pools features from all levels for each proposal in object detection and fuses them for the following prediction. For each proposal, we map them to different feature levels. Following the idea of [Mask R-CNN](https://paperswithcode.com/method/adaptive-feature-pooling), [RoIAlign](https://paperswithcode.com/method/roi-align) is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels.\r \r The motivation for this technique is that in an [FPN](https://paperswithcode.com/method/fpn) we assign proposals to different feature levels based on the size of proposals, which could be suboptimal if images with small differences are assigned to different levels, or if the importance of features is not strongly correlated to their level which they belong.""" ; skos:prefLabel "Adaptive Feature Pooling" . :AdaptiveInputRepresentations a skos:Concept ; dcterms:source ; skos:definition "**Adaptive Input Embeddings** extend the [adaptive softmax](https://paperswithcode.com/method/adaptive-softmax) to input word representations. The factorization assigns more capacity to frequent words and reduces the capacity for less frequent words with the benefit of reducing overfitting to rare words." ; skos:prefLabel "Adaptive Input Representations" . :AdaptiveInstanceNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Adaptive Instance Normalization** is a normalization method that aligns the mean and variance of the content features with those of the style features. \r \r [Instance Normalization](https://paperswithcode.com/method/instance-normalization) normalizes the input to a single style specified by the affine parameters. Adaptive Instance Normaliation is an extension. In AdaIN, we receive a content input $x$ and a style input $y$, and we simply align the channel-wise mean and variance of $x$ to match those of $y$. Unlike [Batch Normalization](https://paperswithcode.com/method/batch-normalization), Instance Normalization or [Conditional Instance Normalization](https://paperswithcode.com/method/conditional-instance-normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:\r \r $$\r \\textrm{AdaIN}(x, y)= \\sigma(y)\\left(\\frac{x-\\mu(x)}{\\sigma(x)}\\right)+\\mu(y)\r $$""" ; skos:prefLabel "Adaptive Instance Normalization" . :AdaptiveLoss a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Robust Loss" ; skos:definition "The Robust Loss is a generalization of the Cauchy/Lorentzian, Geman-McClure, Welsch/Leclerc, generalized Charbonnier, Charbonnier/pseudo-Huber/L1-L2, and L2 loss functions. By introducing robustness as a continuous parameter, the loss function allows algorithms built around robust loss minimization to be generalized, which improves performance on basic vision tasks such as registration and clustering. Interpreting the loss as the negative log of a univariate density yields a general probability distribution that includes normal and Cauchy distributions as special cases. This probabilistic interpretation enables the training of neural networks in which the robustness of the loss automatically adapts itself during training, which improves performance on learning-based tasks such as generative image synthesis and unsupervised monocular depth estimation, without requiring any manual parameter tuning." ; skos:prefLabel "Adaptive Loss" . :AdaptiveMasking a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Adaptive Masking** is a type of attention mechanism that allows a model to learn its own context size to attend over. For each head in [Multi-Head Attention](https://paperswithcode.com/method/multi-head-attention), a masking function is added to control for the span of the attention. A masking function is a non-increasing function that maps a\r distance to a value in $\\left[0, 1\\right]$. Adaptive masking takes the following soft masking function $m\\_{z}$ parametrized by a real value $z$ in $\\left[0, S\\right]$:\r \r $$ m\\_{z}\\left(x\\right) = \\min\\left[\\max\\left[\\frac{1}{R}\\left(R+z-x\\right), 0\\right], 1\\right] $$\r \r where $R$ is a hyper-parameter that controls its softness. The shape of this piecewise function as a function of the distance. This soft masking function is inspired by [Jernite et al. (2017)](https://arxiv.org/abs/1611.06188). The attention weights from are then computed on the masked span:\r \r $$ a\\_{tr} = \\frac{m\\_{z}\\left(t-r\\right)\\exp\\left(s\\_{tr}\\right)}{\\sum^{t-1}\\_{q=t-S}m\\_{z}\\left(t-q\\right)\\exp\\left(s\\_{tq}\\right)}$$\r \r A $\\mathcal{l}\\_{1}$ penalization is added on the parameters $z\\_{i}$ for each attention head $i$ of the model to the loss function:\r \r $$ L = - \\log{P}\\left(w\\_{1}, \\dots, w\\_{T}\\right) + \\frac{\\lambda}{M}\\sum\\_{i}z\\_{i} $$\r \r where $\\lambda > 0$ is the regularization hyperparameter, and $M$ is the number of heads in each\r layer. This formulation is differentiable in the parameters $z\\_{i}$, and learnt jointly with the rest of the model.""" ; skos:prefLabel "Adaptive Masking" . :AdaptiveNMS a skos:Concept ; dcterms:source ; skos:definition "**Adaptive Non-Maximum Suppression** is a non-maximum suppression algorithm that applies a dynamic suppression threshold to an instance according to the target density. The motivation is to find an NMS algorithm that works well for pedestrian detection in a crowd. Intuitively, a high NMS threshold keeps more crowded instances while a low NMS threshold wipes out more false positives. The adaptive-NMS thus applies a dynamic suppression strategy, where the threshold rises as instances gather and occlude each other and decays when instances appear separately. To this end, an auxiliary and learnable sub-network is designed to predict the adaptive NMS threshold for each instance." ; skos:prefLabel "Adaptive NMS" . :AdaptiveSoftmax a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Adaptive Softmax** is a speedup technique for the computation of probability distributions over words. The adaptive [softmax](https://paperswithcode.com/method/softmax) is inspired by the class-based [hierarchical softmax](https://paperswithcode.com/method/hierarchical-softmax), where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node\r and reducing the capacity of rare words.""" ; skos:prefLabel "Adaptive Softmax" . :AdaptiveSpanTransformer a skos:Concept ; dcterms:source ; skos:definition """The **Adaptive Attention Span Transformer** is a Transformer that utilises an improvement to the self-attention layer called [adaptive masking](https://paperswithcode.com/method/adaptive-masking) that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.\r \r Their proposals are based on the observation that, with the dense attention of a traditional [Transformer](https://paperswithcode.com/method/transformer), each attention head shares the same attention span $S$ (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).""" ; skos:prefLabel "Adaptive Span Transformer" . :AdaptivelySparseTransformer a skos:Concept ; dcterms:source ; skos:definition "The **Adaptively Sparse Transformer** is a type of [Transformer](https://paperswithcode.com/method/transformer)." ; skos:prefLabel "Adaptively Sparse Transformer" . :AdditiveAttention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Additive Attention**, also known as **Bahdanau Attention**, uses a one-hidden layer feed-forward network to calculate the attention alignment score:\r \r $$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = v\\_{a}^{T}\\tanh\\left(\\textbf{W}\\_{a}\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right]\\right)$$\r \r where $\\textbf{v}\\_{a}$ and $\\textbf{W}\\_{a}$ are learned attention parameters. Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows.\r \r Within a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https://paperswithcode.com/method/softmax) function of these alignment scores (ensuring it sums to 1).""" ; skos:prefLabel "Additive Attention" . :AdvProp a skos:Concept ; dcterms:source ; skos:definition "**AdvProp** is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to the method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples." ; skos:prefLabel "AdvProp" . :AdversarialColorEnhancement a skos:Concept ; dcterms:source ; skos:definition "**Adversarial Color Enhancement** is an approach to generating unrestricted adversarial images by optimizing a color filter via gradient descent." ; skos:prefLabel "Adversarial Color Enhancement" . :AdversarialSoftAdvantageFitting\(ASAF\) a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Adversarial Soft Advantage Fitting (ASAF)" . :AdversarialSolarization a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Adversarial Solarization" . :AffCorrs a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Affordance Correspondence" ; skos:definition "Method for one-shot visual search of object parts / one-shot semantic part correspondence. Given a single reference image of an object with annotated affordance regions, it segments semantically corresponding parts within a target scene. AffCorrs is used to find corresponding affordances both for intra- and inter-class one-shot part segmentation." ; skos:prefLabel "AffCorrs" . :AffineCoupling a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Affine Coupling** is a method for implementing a normalizing flow (where we stack a sequence of invertible bijective transformation functions). Affine coupling is one of these bijective transformation functions. Specifically, it is an example of a reversible transformation where the forward function, the reverse function and the log-determinant are computationally efficient. For the forward function, we split the input dimension into two parts:\r \r $$ \\mathbf{x}\\_{a}, \\mathbf{x}\\_{b} = \\text{split}\\left(\\mathbf{x}\\right) $$\r \r The second part stays the same $\\mathbf{x}\\_{b} = \\mathbf{y}\\_{b}$, while the first part $\\mathbf{x}\\_{a}$ undergoes an affine transformation, where the parameters for this transformation are learnt using the second part $\\mathbf{x}\\_{b}$ being put through a neural network. Together we have:\r \r $$ \\left(\\log{\\mathbf{s}, \\mathbf{t}}\\right) = \\text{NN}\\left(\\mathbf{x}\\_{b}\\right) $$\r \r $$ \\mathbf{s} = \\exp\\left(\\log{\\mathbf{s}}\\right) $$\r \r $$ \\mathbf{y}\\_{a} = \\mathbf{s} \\odot \\mathbf{x}\\_{a} + \\mathbf{t} $$\r \r $$ \\mathbf{y}\\_{b} = \\mathbf{x}\\_{b} $$\r \r $$ \\mathbf{y} = \\text{concat}\\left(\\mathbf{y}\\_{a}, \\mathbf{y}\\_{b}\\right) $$\r \r Image: [GLOW](https://paperswithcode.com/method/glow)""" ; skos:prefLabel "Affine Coupling" . :AffineOperator a skos:Concept ; dcterms:source ; skos:definition """The **Affine Operator** is an affine transformation layer introduced in the [ResMLP](https://paperswithcode.com/method/resmlp) architecture. This replaces [layer normalization](https://paperswithcode.com/method/layer-normalization), as in [Transformer based networks](https://paperswithcode.com/methods/category/transformers), which is possible since in the ResMLP, there are no [self-attention layers](https://paperswithcode.com/method/scaled) which makes training more stable - hence allowing a more simple affine transformation.\r \r The affine operator is defined as:\r \r $$ \\operatorname{Aff}_{\\mathbf{\\alpha}, \\mathbf{\\beta}}(\\mathbf{x})=\\operatorname{Diag}(\\mathbf{\\alpha}) \\mathbf{x}+\\mathbf{\\beta} $$\r \r where $\\alpha$ and $\\beta$ are learnable weight vectors. This operation only rescales and shifts the input element-wise. This operation has several advantages over other normalization operations: first, as opposed to Layer Normalization, it has no cost at inference time, since it can absorbed in the adjacent linear layer. Second, as opposed to [BatchNorm](https://paperswithcode.com/method/batch-normalization) and Layer Normalization, the Aff operator does not depend on batch statistics.""" ; skos:prefLabel "Affine Operator" . :AggMo a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Aggregated Momentum (AggMo)** is a variant of the [classical momentum](https://paperswithcode.com/method/sgd-with-momentum) stochastic optimizer which maintains several velocity vectors with different $\\beta$ parameters. AggMo averages the velocity vectors when updating the parameters. It resolves the problem of choosing a momentum parameter by taking a linear combination of multiple momentum buffers. Each of $K$ momentum buffers have a different discount factor $\\beta \\in \\mathbb{R}^{K}$, and these are averaged for the update. The update rule is:\r \r $$ \\textbf{v}\\_{t}^{\\left(i\\right)} = \\beta^{(i)}\\textbf{v}\\_{t-1}^{\\left(i\\right)} - \\nabla\\_{\\theta}f\\left(\\mathbf{\\theta}\\_{t-1}\\right) $$\r \r $$ \\mathbf{\\theta\\_{t}} = \\mathbf{\\theta\\_{t-1}} + \\frac{\\gamma\\_{t}}{K}\\sum^{K}\\_{i=1}\\textbf{v}\\_{t}^{\\left(i\\right)} $$\r \r where $v^{\\left(i\\right)}_{0}$ for each $i$. The vector $\\mathcal{\\beta} = \\left[\\beta^{(1)}, \\ldots, \\beta^{(K)}\\right]$ is the dampening factor.""" ; skos:prefLabel "AggMo" . :AgglomerativeContextualDecomposition a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Agglomerative Contextual Decomposition (ACD)** is an interpretability method that produces hierarchical interpretations for a single prediction made by a neural network, by scoring interactions and building them into a tree. Given a prediction from a trained neural network, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive." ; skos:prefLabel "Agglomerative Contextual Decomposition" . :AggregatedLearning a skos:Concept ; dcterms:source ; skos:definition "**Aggregated Learning (AgrLearn)** is a vector-quantization approach to learning neural network classifiers. It builds on an equivalence between IB learning and IB quantization and exploits the power of vector quantization, which is well known in information theory." ; skos:prefLabel "Aggregated Learning" . :AgingEvolution a skos:Concept ; dcterms:source ; skos:definition """**Aging Evolution**, or **Regularized Evolution**, is an evolutionary algorithm for [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). Whereas in tournament selection, the best architectures are kept, in aging evolution we associate each genotype with an age, and bias the tournament selection to choose\r the younger genotypes. In the context of architecture search, aging evolution allows us to explore the search space more, instead of zooming in on good models too early, as non-aging evolution would.""" ; skos:prefLabel "Aging Evolution" . :AlexNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**AlexNet** is a classic convolutional neural network architecture. It consists of convolutions, [max pooling](https://paperswithcode.com/method/max-pooling) and dense layers as the basic building blocks. Grouped convolutions are used in order to fit the model across two GPUs." ; skos:prefLabel "AlexNet" . :AlignPS a skos:Concept ; dcterms:source ; skos:altLabel "Feature-Aligned Person Search Network" ; skos:definition "**AlignPS**, or **Feature-Aligned Person Search Network**, is an anchor-free framework for efficient person search. The model employs the typical architecture of an anchor-free detection model (i.e., [FCOS](https://paperswithcode.com/method/fcos)). An aligned feature aggregation (AFA) module is designed to make the model focus more on the re-id subtask. Specifically, AFA reshapes some building blocks of [FPN](https://paperswithcode.com/method/fpn) to overcome the issues of region and scale misalignment in re-id feature learning. A [deformable convolution](https://paperswithcode.com/method/deformable-convolution) is exploited to make the re-id embeddings adaptively aligned with the foreground regions. A feature fusion scheme is designed to better aggregate features from different FPN levels, which makes the re-id features more robust to scale variations. The training procedures of re-id and detection are also optimized to place more emphasis on generating robust re-id embeddings." ; skos:prefLabel "AlignPS" . :All-AttentionLayer a skos:Concept ; dcterms:source ; skos:definition "An **All-Attention Layer** is an attention module and layer for transformers that merges the self-attention and feedforward sublayers into a single unified attention layer. As opposed to the two-step mechanism of the [Transformer](https://paperswithcode.com/method/transformer) layer, it directly builds its representation from the context and a persistent memory block without going through a feedforward transformation. The additional persistent memory block stores, in the form of key-value vectors, information that does not depend on the context. In terms of parameters, these persistent key-value vectors replace the feedforward sublayer." ; skos:prefLabel "All-Attention Layer" . :AlphaFold a skos:Concept ; dcterms:source ; skos:definition """AlphaFold is a deep learning based algorithm for accurate protein structure prediction. AlphaFold incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.\r \r Description from: [Highly accurate protein structure prediction with AlphaFold](https://paperswithcode.com/paper/highly-accurate-protein-structure-prediction)\r \r Image credit: [DeepMind](https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology)""" ; skos:prefLabel "AlphaFold" . :AlphaStar a skos:Concept ; rdfs:seeAlso ; skos:altLabel "DeepMind AlphaStar" ; skos:definition """**AlphaStar** is a reinforcement learning agent for tackling the game of Starcraft II. It learns a policy $\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}, z\\right) = P\\left[a\\_{t}\\mid{s\\_{t}}, z\\right]$ using a neural network for parameters $\\theta$ that receives observations $s\\_{t} = \\left(o\\_{1:t}, a\\_{1:t-1}\\right)$ as inputs and chooses actions as outputs. Additionally, the policy conditions on a statistic $z$ that summarizes a strategy sampled from human data such as a build order [1].\r \r AlphaStar uses numerous types of architecture to incorporate different types of features. Observations of player and enemy units are processed with a [Transformer](https://paperswithcode.com/method/transformer). Scatter connections are used to integrate spatial and non-spatial information. The temporal sequence of observations is processed by a core [LSTM](https://paperswithcode.com/method/lstm). Minimap features are extracted with a Residual Network. To manage the combinatorial action space, the agent uses an autoregressive policy and a recurrent [pointer network](https://paperswithcode.com/method/pointer-net).\r \r The agent is trained first with supervised learning from human replays. Parameters are subsequently trained using reinforcement learning that maximizes the win rate against opponents. The RL algorithm is based on a policy-gradient algorithm similar to actor-critic. Updates are performed asynchronously and off-policy. To deal with this, a combination of $TD\\left(\\lambda\\right)$ and [V-trace](https://paperswithcode.com/method/v-trace) are used, as well as a new self-imitation algorithm (UPGO).\r \r Lastly, to address game-theoretic challenges, AlphaStar is trained with league training to try to approximate a fictitious self-play (FSP) setting which avoids cycles by computing a best response against a uniform mixture of all previous policies. The league of potential opponents includes a diverse range of agents, including policies from current and previous agents.\r \r Image Credit: [Yekun Chai](https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/)\r \r #### References\r 1. Chai, Yekun. "AlphaStar: Grandmaster level in StarCraft II Explained." (2019). [https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/](https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/)\r \r #### Code Implementation\r 1. https://github.com/opendilab/DI-star""" ; skos:prefLabel "AlphaStar" . :AlphaZero a skos:Concept ; dcterms:source ; skos:definition "**AlphaZero** is a reinforcement learning agent for playing board games such as Go, chess, and shogi. " ; skos:prefLabel "AlphaZero" . :AltCLIP a skos:Concept ; dcterms:source ; skos:definition "In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI." ; skos:prefLabel "AltCLIP" . :AltDiffusion a skos:Concept ; dcterms:source ; skos:definition "In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI." ; skos:prefLabel "AltDiffusion" . :AlterNet a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "AlterNet" . :AmoebaNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**AmoebaNet** is a convolutional neural network found through regularized evolution architecture search. The search space is NASNet, which specifies a space of image classifiers with a fixed outer structure: a feed-forward stack of [Inception-like modules](https://paperswithcode.com/method/inception-module) called cells. The discovered architecture is shown to the right." ; skos:prefLabel "AmoebaNet" . :AnnealingSNNL a skos:Concept ; dcterms:source ; skos:altLabel "Soft Nearest Neighbor Loss with Annealing Temperature" ; skos:definition "" ; skos:prefLabel "Annealing SNNL" . :Anti-AliasDownsampling a skos:Concept ; dcterms:source ; skos:definition "**Anti-Alias Downsampling (AA)** aims to improve the shift-equivariance of deep networks. Max-pooling is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing in any existing strided layer such as strided [convolution](https://paperswithcode.com/method/convolution). The smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur." ; skos:prefLabel "Anti-Alias Downsampling" . :AnycostGAN a skos:Concept ; dcterms:source ; skos:definition "**Anycost GAN** is a type of generative adversarial network for image synthesis and editing. Given an input image, we project it into the latent space with encoder $E$ and backward optimization. We can modify the latent code with user input to edit the image. During editing, a sub-generator of small cost is used for fast and interactive preview; during idle time, the full cost generator renders the final, high-quality output. The outputs from the full and sub-generators are visually consistent during projection and editing." ; skos:prefLabel "Anycost GAN" . :Ape-X a skos:Concept ; dcterms:source ; skos:definition """**Ape-X** is a distributed architecture for deep reinforcement learning. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared [experience replay](https://paperswithcode.com/method/experience-replay) memory; the learner replays samples of experience and updates the neural network. The architecture relies on [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) to focus only on the most significant data generated by the actors.\r \r In contrast to Gorila, Ape-X uses a shared, centralized replay memory, and instead of sampling\r uniformly, it prioritizes, to sample the most useful data more often. All communications are batched with the centralized replay, increasing the efficiency and throughput at the cost of some latency. \r And by learning off-policy, Ape-X has the ability to combine data from many distributed actors, by giving the different actors different exploration policies, broadening the diversity of the experience they jointly encounter.""" ; skos:prefLabel "Ape-X" . :Ape-XDPG a skos:Concept ; dcterms:source ; skos:definition "**Ape-X DPG** combines [DDPG](https://paperswithcode.com/method/ddpg) with distributed [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) through the [Ape-X](https://paperswithcode.com/method/ape-x) architecture." ; skos:prefLabel "Ape-X DPG" . :Ape-XDQN a skos:Concept ; dcterms:source ; skos:definition "**Ape-X DQN** is a variant of a [DQN](https://paperswithcode.com/method/dqn) with some components of [Rainbow-DQN](https://paperswithcode.com/method/rainbow-dqn) that utilizes distributed [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) through the [Ape-X](https://paperswithcode.com/method/ape-x) architecture." ; skos:prefLabel "Ape-X DQN" . :Apollo a skos:Concept ; dcterms:source ; skos:altLabel "Adaptive Parameter-wise Diagonal Quasi-Newton Method" ; skos:definition "Please enter a description about the method here" ; skos:prefLabel "Apollo" . :ArcFace a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Additive Angular Margin Loss" ; skos:definition """**ArcFace**, or **Additive Angular Margin Loss**, is a loss function used in face recognition tasks. The [softmax](https://paperswithcode.com/method/softmax) is traditionally used in these tasks. However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations. \r \r The ArcFace loss transforms the logits $W^{T}\\_{j}x\\_{i} = || W\\_{j} || \\text{ } || x\\_{i} || \\cos\\theta\\_{j}$,\r where $\\theta\\_{j}$ is the angle between the weight $W\\_{j}$ and the feature $x\\_{i}$. The individual weight $ || W\\_{j} || = 1$ is fixed by $l\\_{2}$ normalization. The embedding feature $ ||x\\_{i} ||$ is fixed by $l\\_{2}$ normalization and re-scaled to $s$. The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding\r features are thus distributed on a hypersphere with a radius of $s$. Finally, an additive angular margin penalty $m$ is added between $x\\_{i}$ and $W\\_{y\\_{i}}$ to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is\r equal to the geodesic distance margin penalty in the normalised hypersphere, the method is named ArcFace:\r \r $$ L\\_{3} = -\\frac{1}{N}\\sum^{N}\\_{i=1}\\log\\frac{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)}}{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)} + \\sum^{n}\\_{j=1, j \\neq y\\_{i}}e^{s\\cos\\theta\\_{j}}} $$\r \r The authors select face images from 8 different identities containing enough samples (around 1,500 images/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively. As the Figure shows, the softmax loss provides roughly separable feature embedding\r but produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes.\r \r Other alternatives to enforce intra-class compactness and inter-class distance include [Supervised Contrastive Learning](https://arxiv.org/abs/2004.11362).""" ; skos:prefLabel "ArcFace" . :Assemble-ResNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Assemble-ResNet** is a modification to the [ResNet](https://paperswithcode.com/method/resnet) architecture with several tweaks including using [ResNet-D](https://paperswithcode.com/method/resnet-d), channel attention, [anti-alias downsampling](https://paperswithcode.com/method/anti-alias-downsampling), and Big Little Networks." ; skos:prefLabel "Assemble-ResNet" . :AssociativeLSTM a skos:Concept ; dcterms:source ; skos:definition """An **Associative LSTM** combines an [LSTM](https://paperswithcode.com/method/lstm) with ideas from Holographic Reduced Representations (HRRs) to enable key-value storage of data. HRRs use a “binding” operator to implement key-value\r binding between two vectors (the key and its associated content). They natively implement associative arrays; as a byproduct, they can also easily implement stacks, queues, or lists.""" ; skos:prefLabel "Associative LSTM" . :AsynchronousInteractionAggregation a skos:Concept ; dcterms:source ; skos:definition "**Asynchronous Interaction Aggregation**, or **AIA**, is a network that leverages different interactions to boost action detection. There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance by modeling very long-term interaction dynamically." ; skos:prefLabel "Asynchronous Interaction Aggregation" . :AttLWB a skos:Concept ; dcterms:source ; skos:altLabel "Attentional Liquid Warping Block" ; skos:definition "**Attentional Liquid Warping Block**, or **AttLWB**, is a module for human image synthesis GANs that propagates the source information - such as texture, style, color and face identity - in both image and feature spaces to the synthesized reference. It firstly learns similarities of the global features among all multiple sources features, and then it fuses the multiple sources features by a linear combination of the learned similarities and the multiple sources in the feature spaces. Finally, to better propagate the source identity (style, color, and texture) into the global stream, the fused source features are warped to the global stream by [Spatially-Adaptive Normalization](https://paperswithcode.com/method/spade) (SPADE)." ; skos:prefLabel "AttLWB" . :Attention-augmentedConvolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Attention-augmented Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) with a two-dimensional relative self-attention mechanism that can replace convolutions as a stand-alone computational primitive for image classification. It employs [scaled-dot product attention](https://paperswithcode.com/method/scaled) and [multi-head attention](https://paperswithcode.com/method/multi-head-attention) as with [Transformers](https://paperswithcode.com/method/transformer).\r \r It works by concatenating convolutional and attentional feature map. To see this, consider an original convolution operator with kernel size $k$, $F\\_{in}$ input filters and $F\\_{out}$ output filters. The corresponding attention augmented convolution can be written as"\r \r $$\\text{AAConv}\\left(X\\right) = \\text{Concat}\\left[\\text{Conv}(X), \\text{MHA}(X)\\right] $$\r \r $X$ originates from an input tensor of shape $\\left(H, W, F\\_{in}\\right)$. This is flattened to become $X \\in \\mathbb{R}^{HW \\times F\\_{in}}$ which is passed into a multi-head attention module, as well as a convolution (see above).\r \r Similarly to the convolution, the attention augmented convolution 1) is equivariant to translation and 2) can readily operate on inputs of different spatial dimensions.""" ; skos:prefLabel "Attention-augmented Convolution" . :AttentionDropout a skos:Concept ; rdfs:seeAlso ; skos:definition """**Attention Dropout** is a type of [dropout](https://paperswithcode.com/method/dropout) used in attention-based architectures, where elements are randomly dropped out of the [softmax](https://paperswithcode.com/method/softmax) in the attention equation. For example, for scaled-dot product attention, we would drop elements from the first term:\r \r $$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$""" ; skos:prefLabel "Attention Dropout" . :AttentionFeatureFilters a skos:Concept ; dcterms:source ; skos:definition "An attention mechanism for content-based filtering of multi-level features. For example, recurrent features obtained by forward and backward passes of a bidirectional RNN block can be combined using attention feature filters, with unprocessed input features/embeddings as queries and recurrent features as keys/values." ; skos:prefLabel "Attention Feature Filters" . :AttentionFreeTransformer a skos:Concept ; dcterms:source ; skos:definition """**Attention Free Transformer**, or **AFT**, is an efficient variant of a [multi-head attention module](https://paperswithcode.com/method/multi-head-attention) that eschews [dot product self attention](https://paperswithcode.com/method/scaled). In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.\r \r Given the input $X$, AFT first linearly transforms them into $Q=X W^{Q}, K=X W^{K}, V=X W^{V}$, then performs following operation:\r \r $$\r Y=f(X) ; Y\\_{t}=\\sigma\\_{q}\\left(Q\\_{t}\\right) \\odot \\frac{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right) \\odot V\\_{t^{\\prime}}}{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right)}\r $$\r \r where $\\odot$ is the element-wise product; $\\sigma\\_{q}$ is the nonlinearity applied to the query with default being sigmoid; $w \\in R^{T \\times T}$ is the learned pair-wise position biases.\r \r Explained in words, for each target position $t$, AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.""" ; skos:prefLabel "Attention Free Transformer" . :AttentionGate a skos:Concept ; dcterms:source ; skos:definition """Attention gate focuses on targeted regions while suppressing feature activations in irrelevant regions.\r Given the input feature map $X$ and the gating signal $G\\in \\mathbb{R}^{C'\\times H\\times W}$ which is collected at a coarse scale and contains contextual information, the attention gate uses additive attention to obtain the gating coefficient. Both the input $X$ and the gating signal are first linearly mapped to an $\\mathbb{R}^{F\\times H\\times W}$ dimensional space, and then the output is squeezed in the channel domain to produce a spatial attention weight map $ S \\in \\mathbb{R}^{1\\times H\\times W}$. The overall process can be written as\r \\begin{align}\r S &= \\sigma(\\varphi(\\delta(\\phi_x(X)+\\phi_g(G))))\r \\end{align}\r \\begin{align}\r Y &= S X\r \\end{align}\r where $\\varphi$, $\\phi_x$ and $\\phi_g$ are linear transformations implemented as $1\\times 1$ convolutions. \r \r The attention gate guides the model's attention to important regions while suppressing feature activation in unrelated areas. It substantially enhances the representational power of the model without a significant increase in computing cost or number of model parameters due to its lightweight design. It is general and modular, making it simple to use in various CNN models.""" ; skos:prefLabel "Attention Gate" . :AttentionMesh a skos:Concept ; dcterms:source ; skos:definition "**Attention Mesh** is a neural network architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Specifically region-specific heads are employed that transform the feature maps with spatial transformers." ; skos:prefLabel "Attention Mesh" . :AttentionalLiquidWarpingGAN a skos:Concept ; dcterms:source ; skos:definition "**Attentional Liquid Warping GAN** is a type of generative adversarial network for human image synthesis that utilizes a [AttLWB](https://paperswithcode.com/method/attlwb) block, which is a 3D body mesh recovery module that disentangles pose and shape. To preserve the source information, such as texture, style, color, and face identity, the Attentional Liquid Warping GAN with AttLWB propagates the source information in both image and feature spaces to the synthesized reference." ; skos:prefLabel "Attentional Liquid Warping GAN" . :AttentiveNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Attentive Normalization** generalizes the common affine transformation component in the vanilla feature normalization. Instead of learning a single affine transformation, AN learns a mixture of affine transformations and utilizes their weighted-sum as the final affine transformation applied to re-calibrate features in an instance-specific way. The weights are learned by leveraging feature attention." ; skos:prefLabel "Attentive Normalization" . :Attribute2Font a skos:Concept ; dcterms:source ; skos:definition "**Attribute2Font** is a model that automatically creates fonts by synthesizing visually pleasing glyph images according to user-specified attributes and their corresponding values. Specifically, Attribute2Font is trained to perform font style transfer between any two fonts conditioned on their attribute values. After training, the model can generate glyph images in accordance with an arbitrary set of font attribute values. A unit named Attribute Attention Module is designed to make those generated glyph images better embody the prominent font attributes. A semi-supervised learning scheme is also introduced to exploit a large number of unlabeled fonts" ; skos:prefLabel "Attribute2Font" . :AugMix a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "AugMix mixes augmented images through linear interpolations. Consequently it is like [Mixup](https://paperswithcode.com/method/mixup) but instead mixes augmented versions of the same image." ; skos:prefLabel "AugMix" . :AugmentedSBERT a skos:Concept ; dcterms:source ; skos:definition "**Augmented SBERT** is a data augmentation strategy for pairwise sentence scoring that uses a [BERT](https://paperswithcode.com/method/bert) cross-encoder to improve the performance for the [SBERT](https://paperswithcode.com/method/sbert) bi-encoders. Given a pre-trained, well-performing crossencoder, we sample sentence pairs according to a certain sampling strategy and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended training dataset." ; skos:prefLabel "Augmented SBERT" . :Auto-Classifier a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Auto-Classifier" . :AutoAugment a skos:Concept ; dcterms:source ; skos:definition """**AutoAugment** is an automated approach to find data augmentation policies from data. It formulates the problem of finding the best augmentation policy as a discrete search problem. It consists of two components: a search algorithm and a search space. \r \r At a high level, the search algorithm (implemented as a controller RNN) samples a data augmentation policy $S$, which has information about what image processing operation to use, the probability of using the operation in each batch, and the magnitude of the operation. The policy $S$ is used to train a neural network with a fixed architecture, whose validation accuracy $R$ is sent back to update the controller. Since $R$ is not differentiable, the controller will be updated by policy gradient methods. \r \r The operations used are from PIL, a popular Python image library: all functions in PIL that accept an image as input and output an image. It additionally uses two other augmentation techniques: [Cutout](https://paperswithcode.com/method/cutout) and SamplePairing. The operations searched over are ShearX/Y, TranslateX/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout and Sample Pairing.""" ; skos:prefLabel "AutoAugment" . :AutoDropout a skos:Concept ; dcterms:source ; skos:definition """**AutoDropout** automates the process of designing [dropout](https://paperswithcode.com/method/dropout) patterns using a [Transformer](https://paperswithcode.com/method/transformer) based controller. In this method, a controller learns to generate a dropout pattern at every channel and layer of a target network, such as a [ConvNet](https://paperswithcode.com/methods/category/convolutional-neural-networks) or a Transformer. The target network is then trained with the dropped-out pattern, and its resulting validation performance is used as a signal for the controller to learn from. The resulting pattern is applied to a convolutional output channel, which is a common building block of image recognition models.\r \r The controller network generates the tokens to describe the configurations of the dropout pattern. The tokens are generated like words in a language model. For every layer in a ConvNet, a group of 8 tokens need to be made to create a dropout pattern. These 8 tokens are generated sequentially. In the figure above, size, stride, and repeat indicate the size and the tiling of the pattern; rotate, shear_x, and shear_y specify the geometric transformations of the pattern; share_c is a binary deciding whether a pattern is applied to all $C$ channels; and residual is a binary deciding whether the pattern is applied to the residual branch as well. If we need $L$ dropout patterns, the controller will generate $8L$ decisions.""" ; skos:prefLabel "AutoDropout" . :AutoEncoder a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """An **Autoencoder** is a bottleneck architecture that turns a high-dimensional input into a latent low-dimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder).\r \r Image: [Michael Massi](https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder_schema.png)""" ; skos:prefLabel "AutoEncoder" . :AutoGAN a skos:Concept ; dcterms:source ; skos:definition "[Neural architecture search](https://paperswithcode.com/method/neural-architecture-search) (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variations and use an RNN controller to guide the search, with parameter sharing and dynamic-resetting to accelerate the process. Inception score is adopted as the reward, and a multi-level search strategy is introduced to perform NAS in a progressive way." ; skos:prefLabel "AutoGAN" . :AutoGL a skos:Concept ; dcterms:source ; skos:altLabel "Automated Graph Learning" ; skos:definition "Automated graph learning is a method that aims at discovering the best hyper-parameter and neural architecture configuration for different graph tasks/data without manual design." ; skos:prefLabel "AutoGL" . :AutoInt a skos:Concept ; dcterms:source ; skos:definition "**AutoInt** is a deep tabular learning method that models high-order feature interactions of input features. AutoInt can be applied to both numerical and categorical input features. Specifically, both the numerical and categorical features are mapped into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled." ; skos:prefLabel "AutoInt" . :AutoML-Zero a skos:Concept ; dcterms:source ; skos:definition """**AutoML-Zero** is an AutoML technique that aims to search a fine-grained space simultaneously for the model, optimization procedure, initialization, and so on, permitting much less human-design and even allowing the discovery of non-neural network algorithms. It represents ML algorithms as computer programs comprised of three component functions, Setup, Predict, and Learn, that performs initialization, prediction and learning. The instructions in these functions apply basic mathematical operations on a small memory. The operation and memory addresses used by each instruction are free parameters in the search space, as is the size of the component functions. While this reduces expert design, the consequent sparsity means that [random search](https://paperswithcode.com/method/random-search) cannot make enough progress. To overcome this difficulty, the authors use small proxy tasks and migration techniques to build an optimized infrastructure capable of searching through 10,000 models/second/cpu core.\r \r Evolutionary methods can find solutions in the AutoML-Zero search space despite its enormous\r size and sparsity. The authors show that by randomly modifying the programs and periodically selecting the best performing ones on given tasks/datasets, AutoML-Zero discovers reasonable algorithms. They start from empty programs and using data labeled by “teacher” neural networks with random weights, and demonstrate evolution can discover neural networks trained by gradient descent. Following this, they minimize bias toward known algorithms by switching to binary classification tasks extracted from CIFAR-10 and allowing a larger set of possible operations. This discovers interesting techniques like multiplicative interactions, normalized gradient and weight averaging. Finally, they show it is possible for evolution to adapt the algorithm to the type of task provided. For example, [dropout](https://paperswithcode.com/method/dropout)-like operations emerge when the task needs regularization and learning rate decay appears when the task requires faster convergence.""" ; skos:prefLabel "AutoML-Zero" . :AutoSmart a skos:Concept ; dcterms:source ; skos:definition "**AutoSmart** is AutoML framework for temporal relational data. The framework includes automatic data processing, table merging, feature engineering, and model tuning, integrated with a time&memory control unit." ; skos:prefLabel "AutoSmart" . :AutoSync a skos:Concept ; dcterms:source ; skos:definition "**AutoSync** is a pipeline for automatically optimizing synchronization strategies, given model structures and resource specifications, in data-parallel distributed machine learning. By factorizing the synchronization strategy with respect to each trainable building block of a DL model, we can construct a valid and large strategy space spanned by multiple factors. AutoSync efficiently navigates the space and locates the optimal strategy. AutoSync leverages domain knowledge about synchronization systems to reduce the search space, and is equipped with a domain adaptive simulator, which combines principled communication modeling and data-driven ML models, to estimate the runtime of strategy proposals without launching real distributed execution." ; skos:prefLabel "AutoSync" . :AutoTinyBERT a skos:Concept ; dcterms:source ; skos:definition "**AutoTinyBERT** is a an efficient [BERT](https://paperswithcode.com/method/bert) variant found through neural architecture search. Specifically, one-shot learning is used to obtain a big Super Pretrained Language Model (SuperPLM), where the objectives of pre-training or task-agnostic BERT distillation are used. Then, given a specific latency constraint, an evolutionary algorithm is run on the SuperPLM to search optimal architectures. Finally, we extract the corresponding sub-models based on the optimal architectures and further train these models." ; skos:prefLabel "AutoTinyBERT" . :AuxiliaryBatchNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Auxiliary Batch Normalization** is a type of regularization used in adversarial training schemes. The idea is that adversarial examples should have a separate [batch normalization](https://paperswithcode.com/method/batch-normalization) components to the clean examples, as they have different underlying statistics." ; skos:prefLabel "Auxiliary Batch Normalization" . :AuxiliaryClassifier a skos:Concept ; skos:definition "**Auxiliary Classifiers** are type of architectural component that seek to improve the convergence of very deep networks. They are classifier heads we attach to layers before the end of the network. The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem. They are notably used in the Inception family of convolutional neural networks." ; skos:prefLabel "Auxiliary Classifier" . :AveragePooling a skos:Concept ; skos:definition """**Average Pooling** is a pooling operation that calculates the average value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. It extracts features more smoothly than [Max Pooling](https://paperswithcode.com/method/max-pooling), whereas max pooling extracts more pronounced features like edges.\r \r Image Source: [here](https://www.researchgate.net/figure/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max_fig2_333593451)""" ; skos:prefLabel "Average Pooling" . :AxialAttention a skos:Concept ; dcterms:source ; skos:definition """**Axial Attention** is a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. It was first proposed in [CCNet](https://paperswithcode.com/method/ccnet) [1] named as criss-cross attention, which harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Ho et al [2] extents CCNet to process multi-dimensional data. The proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. It serves as the basic building block for developing self-attention-based autoregressive models for high-dimensional data tensors, e.g., Axial Transformers. It has been applied in [AlphaFold](https://paperswithcode.com/method/alphafold) [3] for interpreting protein sequences.\r \r [1] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, Wenyu Liu. CCNet: Criss-Cross Attention for Semantic Segmentation. ICCV, 2019.\r \r [2] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans. arXiv:1912.12180\r \r [3] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Jul 15:1-1.""" ; skos:prefLabel "Axial Attention" . :BAGUA a skos:Concept ; dcterms:source ; skos:definition "**BAGUA** is a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. The abstraction goes beyond parameter server and Allreduce paradigms, and provides a collection of MPI-style collective operations to facilitate communications with different precision and centralization strategies." ; skos:prefLabel "BAGUA" . :BAM a skos:Concept ; dcterms:source ; skos:altLabel "Bottleneck Attention Module" ; skos:definition """Park et al. proposed the bottleneck attention module (BAM), aiming\r to efficiently improve the representational capability of networks. \r It uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested by ResNet to save computational cost.\r \r For a given input feature map $X$, BAM infers the channel attention $s_c \\in \\mathbb{R}^C$ and spatial attention $s_s\\in \\mathbb{R}^{H\\times W}$ in two parallel streams, then sums the two attention maps after resizing both branch outputs to $\\mathbb{R}^{C\\times H \\times W}$. The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as\r \\begin{align}\r s_c &= \\text{BN}(W_2(W_1\\text{GAP}(X)+b_1)+b_2)\r \\end{align}\r \r \\begin{align}\r s_s &= BN(Conv_2^{1 \\times 1}(DC_2^{3\\times 3}(DC_1^{3 \\times 3}(Conv_1^{1 \\times 1}(X))))) \r \\end{align}\r \\begin{align}\r s &= \\sigma(\\text{Expand}(s_s)+\\text{Expand}(s_c)) \r \\end{align}\r \\begin{align}\r Y &= s X+X\r \\end{align}\r where $W_i$, $b_i$ denote weights and biases of fully connected layers respectively, $Conv_{1}^{1\\times 1}$ and $Conv_{2}^{1\\times 1}$ are convolution layers used for channel reduction. $DC_i^{3\\times 3}$ denotes a dilated convolution with $3\\times 3$ kernel, applied to utilize contextual information effectively. $\\text{Expand}$ expands the attention maps $s_s$ and $s_c$ to $\\mathbb{R}^{C\\times H\\times W}$.\r \r BAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.""" ; skos:prefLabel "BAM" . :BART a skos:Concept ; dcterms:source ; skos:definition "**BART** is a [denoising autoencoder](https://paperswithcode.com/method/denoising-autoencoder) for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard [Transformer](https://paperswithcode.com/method/transformer)-based neural machine translation architecture. It uses a standard seq2seq/NMT architecture with a bidirectional encoder (like [BERT](https://paperswithcode.com/method/bert)) and a left-to-right decoder (like [GPT](https://paperswithcode.com/method/gpt)). This means the encoder's attention mask is fully visible, like BERT, and the decoder's attention mask is causal, like [GPT2](https://paperswithcode.com/method/gpt-2)." ; skos:prefLabel "BART" . :BASE a skos:Concept ; dcterms:source ; skos:altLabel "Balanced Selection" ; skos:definition "" ; skos:prefLabel "BASE" . :BASNet a skos:Concept ; dcterms:source ; skos:altLabel "Boundary-Aware Segmentation Network" ; skos:definition """**BASNet**, or **Boundary-Aware Segmentation Network**, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation. The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual \r refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.""" ; skos:prefLabel "BASNet" . :BERT a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**BERT**, or Bidirectional Encoder Representations from Transformers, improves upon standard [Transformers](http://paperswithcode.com/method/transformer) by removing the unidirectionality constraint by using a *masked language model* (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a *next sentence prediction* task that jointly pre-trains text-pair representations. \r \r There are two steps in BERT: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they\r are initialized with the same pre-trained parameters.""" ; skos:prefLabel "BERT" . :BIDeN a skos:Concept ; dcterms:source ; skos:altLabel "Blind Image Decomposition Network" ; skos:definition """**BIDeN**, or **Blind Image Decomposition Network**, is a model for blind image decomposition, which requires separating a superimposed image into constituent underlying images in a blind setting, that is, both the source components involved in mixing as well as the mixing mechanism are unknown. For example, rain may consist of multiple components, such as rain streaks, raindrops, snow, and haze. \r \r The Figure shows an example where $N = 4, L = 2, x = {a, b, c, d}$, and $I = {1, 3}$. $a, c$ are selected then passed to the mixing function $f$, and outputs the mixed input image $z$, which is $f\\left(a, c\\right)$ here. The generator consists of an encoder $E$ with three branches and multiple heads $H$. $\\bigotimes$ denotes the concatenation operation. Depth and receptive field of each branch is different to capture multiple scales of features. Each specified head points to the corresponding source component, and the number of heads varies with the maximum number of source components N. All reconstructed images $\\left(a', c'\\right)$ and their corresponding real images $\\left(a, c\\right)$ are sent to an unconditional discriminator. The discriminator also predicts the source components of the input image $z$. The outputs from other heads $\\left(b', d'\\right)$ do not contribute to the optimization.""" ; skos:prefLabel "BIDeN" . :BIMAN a skos:Concept ; dcterms:source ; skos:definition "**BIMAN**, or **Bot Identification by commit Message, commit Association, and author Name**, is a technique to detect bots that commit code. It is comprised of three methods that consider independent aspects of the commits made by a particular author: 1) Commit Message: Identify if commit messages are being generated from templates; 2) Commit Association: Predict if an author is a bot using a random forest model, with features related to files and projects associated with the commits as predictors; and 3) Author Name: Match author’s name and email to common bot patterns." ; skos:prefLabel "BIMAN" . :BLANC a skos:Concept ; dcterms:source ; skos:definition "**BLANC** is an automatic estimation approach for document summary quality. The goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. BLANC achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document's text." ; skos:prefLabel "BLANC" . :BLIP a skos:Concept ; dcterms:source ; skos:altLabel "BLIP: Bootstrapping Language-Image Pre-training" ; skos:definition "Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP." ; skos:prefLabel "BLIP" . :BLOOM a skos:Concept ; dcterms:source ; skos:definition """**BLOOM** is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of\r sources in 46 natural and 13 programming languages (59 in total).""" ; skos:prefLabel "BLOOM" . :BLOOMZ a skos:Concept ; dcterms:source ; skos:definition "**BLOOMZ** is a Multitask prompted finetuning (MTF) variant of BLOOM." ; skos:prefLabel "BLOOMZ" . :BP-Transformer a skos:Concept ; dcterms:source ; skos:definition """The **BP-Transformer (BPT)** is a type of [Transformer](https://paperswithcode.com/method/transformer) that is motivated by the need to find a better balance between capability and computational complexity for self-attention. The architecture partitions the input sequence into different multi-scale spans via binary partitioning (BP). It incorporates an inductive bias of attending the context information from fine-grain to coarse-grain as the relative distance increases. The farther the context information is, the coarser its representation is.\r BPT can be regard as graph neural network, whose nodes are the multi-scale spans. A token node can attend the smaller-scale span for the closer context and the larger-scale span for the longer distance context. The representations of nodes are updated with [Graph Self-Attention](https://paperswithcode.com/method/graph-self-attention).""" ; skos:prefLabel "BP-Transformer" . :BPE a skos:Concept ; dcterms:source ; skos:altLabel "Byte Pair Encoding" ; skos:definition """**Byte Pair Encoding**, or **BPE**, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).\r \r [Lei Mao](https://leimao.github.io/blog/Byte-Pair-Encoding/) has a detailed blog post that explains how this works.""" ; skos:prefLabel "BPE" . :BRepNet a skos:Concept ; dcterms:source ; skos:definition "**BRepNet** is a neural network for CAD applications. It is designed to operate directly on B-rep data structures, avoiding the need to approximate the model as meshes or point clouds. BRepNet defines convolutional kernels with respect to oriented coedges in the data structure. In the neighborhood of each coedge, a small collection of faces, edges and coedges can be identified and patterns in the feature vectors from these entities detected by specific learnable parameters." ; skos:prefLabel "BRepNet" . :BS-Net a skos:Concept ; dcterms:source ; skos:definition """**BS-Net** is an architecture for COVID-19 severity prediction based on clinical data from different modalities. The architecture comprises 1) a shared multi-task feature extraction backbone, 2) a lung segmentation branch, 3) an original registration mechanism that acts as a ”multi-resolution feature alignment” block operating on the encoding backbone , and 4) a multi-regional classification part for the final six-valued score estimation. \r \r All these blocks act together in the final training thanks to a loss specifically crated for this task. This loss guarantees also performance robustness, comprising a differentiable version of the target discrete metric. The learning phase operates in a weakly-supervised fashion. This is due to the fact that difficulties and pitfalls in the visual interpretation of the disease signs on CXRs (spanning from subtle findings to heavy lung impairment), and the lack of detailed localization information, produces unavoidable inter-rater variability among radiologists in assigning scores.\r \r Specifically the architectural details are:\r \r - The input image is processed with a convolutional backbone; the authors opt for a [ResNet](https://paperswithcode.com/method/resnet)-18.\r - Segmentation is performed by a nested version of [U-Net](https://paperswithcode.com/method/u-net) (U-Net++).\r - Alignment is estimated through the segmentation probability map produced by the U-Net++ decoder, which is achieved through a [spatial transformer network](https://paperswithcode.com/method/spatial-transformer) -- able to estimate the spatial transform matrix in order to center, rotate, and correctly zoom the lungs. After alignment at various scales, features are forward to a [ROIPool](https://paperswithcode.com/method/roi-pooling). \r - The alignment block is pre-trained on the synthetic alignment dataset in a weakly-supervised setting, using a Dice loss.\r - The scoring head uses [FPNs](https://paperswithcode.com/method/fpn) for the combination of multi-scale feature maps. The multiresolution feature aligner produces input feature maps that are well focused on the specific area of interest. Eventually, the output of the FPN layer flows in a series of convolutional blocks to retrieve the output map. The classification is performed by a final [Global Average Pooling](https://paperswithcode.com/method/global-average-pooling) layer and a [SoftMax](https://paperswithcode.com/method/softmax) activation.\r - The Loss function used for training is a sparse categorical cross entropy (SCCE) with a (differentiable) mean absolute error contribution.""" ; skos:prefLabel "BS-Net" . :BTF a skos:Concept ; dcterms:source ; skos:altLabel "Back to the Feature" ; skos:definition "" ; skos:prefLabel "BTF" . :BTmPG a skos:Concept ; dcterms:source ; skos:definition """**BTmPG**, or **Back-Translation guided multi-round Paraphrase Generation**, is a multi-round paraphrase generation method that leverages back-translation to guide paraphrase model during training and generates paraphrases in a multiround process. The model regards paraphrase generation as a monolingual translation task. Given a paraphrase pair $\\left(S\\_{0}, P\\right)$, which $S\\_{0}$ is the original/source sentence and $P$ is the target paraphrase given in the dataset. In the first round generation, we send $S\\_{0}$ into a paraphrase model to generate a paraphrase $S\\_{1}$. In the second round generation, we use the $S\\_{1}$ as the input of the model to generate a new paraphrase $S\\_{2}$. And so forth, in the $i$-th round generation, we send $S\\_{i−1}$ into the paraphrase model to generate $S\\_{i}$.\r .""" ; skos:prefLabel "BTmPG" . :BYOL a skos:Concept ; dcterms:source ; skos:altLabel "Bootstrap Your Own Latent" ; skos:definition """BYOL (Bootstrap Your Own Latent) is a new approach to self-supervised learning. BYOL’s goal is to learn a representation $y_θ$ which can then be used for downstream tasks. BYOL uses two neural networks to learn: the online and target networks. The online network is defined by a set of weights $θ$ and is comprised of three stages: an encoder $f_θ$, a projector $g_θ$ and a predictor $q_θ$. The target network has the same architecture\r as the online network, but uses a different set of weights $ξ$. The target network provides the regression\r targets to train the online network, and its parameters $ξ$ are an exponential moving average of the\r online parameters $θ$.\r \r Given the architecture diagram on the right, BYOL minimizes a similarity loss between $q_θ(z_θ)$ and $sg(z'{_ξ})$, where $θ$ are the trained weights, $ξ$ are an exponential moving average of $θ$ and $sg$ means stop-gradient. At the end of training, everything but $f_θ$ is discarded, and $y_θ$ is used as the image representation.\r \r Source: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https://paperswithcode.com/paper/bootstrap-your-own-latent-a-new-approach-to-1)\r \r Image credit: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https://paperswithcode.com/paper/bootstrap-your-own-latent-a-new-approach-to-1)""" ; skos:prefLabel "BYOL" . :BalancedFeaturePyramid a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Balanced Feature Pyramid** is a feature pyramid module. It differs from approaches like [FPNs](https://paperswithcode.com/method/fpn) that integrate multi-level features using lateral connections. Instead the BFP strengthens the multi-level features using the same deeply integrated balanced semantic features. The pipeline is shown in the Figure to the right. It consists of four steps, rescaling, integrating, refining and strengthening.\r \r Features at resolution level $l$ are denoted as $C\\_{l}$. The number of multi-level features is denoted as $L$. The indexes of involved lowest and highest levels are denoted as $l\\_{min}$ and $l\\_{max}$. In the Figure, $C\\_{2}$ has the highest resolution. To integrate multi-level features and preserve their semantic hierarchy at the same time, we first resize the multi-level features {$C\\_{2}, C\\_{3}, C\\_{4}, C\\_{5}$} to an intermediate size, i.e., the same size as $C\\_{4}$, with interpolation and max-pooling respectively. Once the features are rescaled, the balanced semantic features are obtained by simple averaging as:\r \r $$ C = \\frac{1}{L}\\sum^{l\\_{max}}\\_{l=l\\_{min}}C\\_{l} $$\r \r The obtained features are then rescaled using the same but reverse procedure to strengthen the original features. Each resolution obtains equal information from others in this procedure. Note that this procedure does not contain any parameter. The authors observe improvement with this nonparametric method, proving the effectiveness of the information flow. \r \r The balanced semantic features can be further refined to be more discriminative. The authors found both the refinements with convolutions directly and the non-local module work well. But the\r non-local module works in a more stable way. Therefore, embedded Gaussian non-local attention is utilized as default. The refining step helps us enhance the integrated features and further improve the results.\r \r With this method, features from low-level to high-level are aggregated at the same time. The outputs\r {$P\\_{2}, P\\_{3}, P\\_{4}, P\\_{5}$} are used for object detection following the same pipeline in FPN.""" ; skos:prefLabel "Balanced Feature Pyramid" . :BalancedL1Loss a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Balanced L1 Loss** is a loss function used for the object detection task. Classification and localization problems are solved simultaneously under the guidance of a multi-task loss since\r [Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn), defined as:\r \r $$ L\\_{p,u,t\\_{u},v} = L\\_{cls}\\left(p, u\\right) + \\lambda\\left[u \\geq 1\\right]L\\_{loc}\\left(t^{u}, v\\right) $$\r \r $L\\_{cls}$ and $L\\_{loc}$ are objective functions corresponding to recognition and localization respectively. Predictions and targets in $L\\_{cls}$ are denoted as $p$ and $u$. $t\\_{u}$ is the corresponding regression results with class $u$. $v$ is the regression target. $\\lambda$ is used for tuning the loss weight under multi-task learning. We call samples with a loss greater than or equal to 1.0 outliers. The other samples are called inliers.\r \r A natural solution for balancing the involved tasks is to tune the loss weights of them. However, owing to the unbounded regression targets, directly raising the weight of localization loss will make the model more sensitive to outliers. These outliers, which can be regarded as hard samples, will produce excessively large gradients that are harmful to the training process. The inliers, which can be regarded as the easy samples, contribute little gradient to the overall gradients compared with the outliers. To be more specific, inliers only contribute 30% gradients average per sample compared with outliers. Considering these issues, the authors introduced the balanced L1 loss, which is denoted as $L\\_{b}$.\r \r Balanced L1 loss is derived from the conventional smooth L1 loss, in which an inflection point is set to separate inliers from outliners, and clip the large gradients produced by outliers with a maximum value of 1.0, as shown by the dashed lines in the Figure to the right. The key idea of balanced L1 loss is promoting the crucial regression gradients, i.e. gradients from inliers (accurate samples), to rebalance\r the involved samples and tasks, thus achieving a more balanced training within classification, overall localization and accurate localization. Localization loss $L\\_{loc}$ uses balanced L1 loss is defined as:\r \r $$ L\\_{loc} = \\sum\\_{i\\in{x,y,w,h}}L\\_{b}\\left(t^{u}\\_{i}-v\\_{i}\\right) $$\r \r The Figure to the right shows that the balanced L1 loss increases the gradients of inliers under the control of a factor denoted as $\\alpha$. A small $\\alpha$ increases more gradient for inliers, but the gradients of outliers are not influenced. Besides, an overall promotion magnification controlled by γ is also brought in for tuning the upper bound of regression errors, which can help the objective function better balancing involved tasks. The two factors that control different aspects are mutually enhanced to reach a more balanced training.$b$ is used to ensure $L\\_{b}\\left(x = 1\\right)$ has the same value for both formulations in the equation below.\r \r By integrating the gradient formulation above, we can get the balanced L1 loss as:\r \r $$ L\\_{b}\\left(x\\right) = \\frac{\\alpha}{b}\\left(b|x| + 1\\right)ln\\left(b|x| + 1\\right) - \\alpha|x| \\text{ if } |x| < 1$$\r \r $$ L\\_{b}\\left(x\\right) = \\gamma|x| + C \\text{ otherwise } $$\r \r in which the parameters $\\gamma$, $\\alpha$, and $b$ are constrained by $\\alpha\\text{ln}\\left(b + 1\\right) = \\gamma$. The default parameters are set as $\\alpha = 0.5$ and $\\gamma = 1.5$""" ; skos:prefLabel "Balanced L1 Loss" . :BarlowTwins a skos:Concept ; dcterms:source ; skos:definition "**Barlow Twins** is a self-supervised learning method that applies redundancy-reduction — a principle first proposed in neuroscience — to self supervised learning. The objective function measures the cross-correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted version of a sample to be similar, while minimizing the redundancy between the components of these vectors. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors." ; skos:prefLabel "Barlow Twins" . :BaseBoosting a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """In the setting of multi-target regression, base boosting permits us to incorporate prior knowledge into the learning mechanism of gradient boosting (or Newton boosting, etc.). Namely, from the vantage of statistics, base boosting is a way of building the following additive expansion in a set of elementary basis functions:\r \\begin{equation}\r h_{j}(X ; \\{ \\alpha_{j}, \\theta_{j} \\}) = X_{j} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r \\end{equation}\r where \r $X$ is an example from the domain $\\mathcal{X},$\r $\\{\\alpha_{j}, \\theta_{j}\\} = \\{\\alpha_{j,1},\\dots, \\alpha_{j,K_{j}},\\theta_{j,1},\\dots,\\theta_{j,K_{j}}\\}$ collects the expansion coefficients and parameter sets,\r $X_{j}$ is the image of $X$ under the $j$th coordinate function (a prediction from a user-specified model),\r $K_{j}$ is the number of basis functions in the linear sum,\r $b(X; \\theta_{j,k})$ is a real-valued function of the example $X,$ characterized by a parameter set $\\theta_{j,k}.$\r \r The aforementioned additive expansion differs from the \r [standard additive expansion](https://projecteuclid.org/download/pdf_1/euclid.aos/1013203451):\r \\begin{equation}\r h_{j}(X ; \\{ \\alpha_{j}, \\theta_{j}\\}) = \\alpha_{j, 0} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r \\end{equation}\r as it replaces the constant offset value $\\alpha_{j, 0}$ with a prediction from a user-specified model. In essence, this modification permits us to incorporate prior knowledge into the for loop of gradient boosting, as the for loop proceeds to build the linear sum by computing residuals that depend upon predictions from the user-specified model instead of the optimal constant model: $\\mbox{argmin} \\sum_{i=1}^{m_{train}} \\ell_{j}(Y_{j}^{(i)}, c),$ where $m_{train}$ denotes the number of training examples, $\\ell_{j}$ denotes a single-target loss function, and $c \\in \\mathbb{R}$ denotes a real number, e.g, $\\mbox{argmin} \\sum_{i=1}^{m_{train}} (Y_{j}^{(i)} - c)^{2} = \\frac{\\sum_{i=1}^{m_{train}} Y_{j}^{(i)}}{m_{train}}.$""" ; skos:prefLabel "Base Boosting" . :BasicVSR a skos:Concept ; dcterms:source ; skos:definition "**BasicVSR** is a video super-resolution pipeline including optical flow and [residual blocks](https://paperswithcode.com/method/residual-connection). It adopts a typical bidirectional recurrent network. The upsampling module $U$ contains multiple [pixel-shuffle](https://paperswithcode.com/method/pixelshuffle) and convolutions. In the Figure, red and blue colors represent the backward and forward propagations, respectively. The propagation branches contain only generic components. $S, W$, and $R$ refer to the flow estimation module, spatial warping module, and residual blocks, respectively." ; skos:prefLabel "BasicVSR" . :BatchChannelNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Batch-Channel Normalization**, or **BCN**, uses batch knowledge to prevent channel-normalized models from getting too close to \"elimination singularities\". Elimination singularities correspond to the points on the training trajectory where neurons become consistently deactivated. They cause degenerate manifolds in the loss landscape which will slow down training and harm model performances." ; skos:prefLabel "BatchChannel Normalization" . :BatchFormer a skos:Concept ; dcterms:source ; skos:altLabel "Batch Transformer" ; skos:definition "learn to explore the sample relationships via transformer networks" ; skos:prefLabel "BatchFormer" . :BatchNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Batch Normalization** aims to reduce internal covariate shift, and in doing so aims to accelerate the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows for use of much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for [Dropout](https://paperswithcode.com/method/dropout).\r \r We apply a batch normalization layer as follows for a minibatch $\\mathcal{B}$:\r \r $$ \\mu\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}x\\_{i} $$\r \r $$ \\sigma^{2}\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}\\left(x\\_{i}-\\mu\\_{\\mathcal{B}}\\right)^{2} $$\r \r $$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{\\mathcal{B}}}{\\sqrt{\\sigma^{2}\\_{\\mathcal{B}}+\\epsilon}} $$\r \r $$ y\\_{i} = \\gamma\\hat{x}\\_{i} + \\beta = \\text{BN}\\_{\\gamma, \\beta}\\left(x\\_{i}\\right) $$\r \r Where $\\gamma$ and $\\beta$ are learnable parameters.""" ; skos:prefLabel "Batch Normalization" . :BatchNuclear-normMaximization a skos:Concept ; dcterms:source ; skos:definition "**Batch Nuclear-norm Maximization** is an approach for aiding classification in label insufficient situations. It involves maximizing the nuclear-norm of the batch output matrix. The nuclear-norm of a matrix is an upper bound of the Frobenius-norm of the matrix. Maximizing nuclear-norm ensures large Frobenius-norm of the batch matrix, which leads to increased discriminability. The nuclear-norm of the batch matrix is also a convex approximation of the matrix rank, which refers to the prediction diversity." ; skos:prefLabel "Batch Nuclear-norm Maximization" . :Batchboost a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Batchboost** is a variation on [MixUp](https://paperswithcode.com/method/mixup) that instead of mixing just two images, mixes many images together." ; skos:prefLabel "Batchboost" . :BayesianREX a skos:Concept ; dcterms:source ; skos:altLabel "Bayesian Reward Extrapolation" ; skos:definition "**Bayesian Reward Extrapolation** is a Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference." ; skos:prefLabel "Bayesian REX" . :Beta-VAE a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Beta-VAE** is a type of variational autoencoder that seeks to discover disentangled latent factors. It modifies [VAEs](https://paperswithcode.com/method/vae) with an adjustable hyperparameter $\\beta$ that balances latent channel capacity and independence constraints with reconstruction accuracy. The idea is to maximize the probability of generating the real data while keeping the distance between the real and estimated distributions small, under a threshold $\\epsilon$. We can use the Kuhn-Tucker conditions to write this as a single equation:\r \r $$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\left[D\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right) - \\epsilon\\right]$$\r \r where the KKT multiplier $\\beta$ is the regularization coefficient that constrains the capacity of the latent channel $\\mathbf{z}$ and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior $p\\left(\\mathbf{z}\\right)$.\r \r We write this again using the complementary slackness assumption to get the Beta-VAE formulation:\r \r $$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) \\geq \\mathcal{L}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\{D}\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right)$$""" ; skos:prefLabel "Beta-VAE" . :BezierAlign a skos:Concept ; dcterms:source ; skos:definition """**BezierAlign** is a feature sampling method for arbitrarily-shaped scene text recognition that exploits parameterization nature of a compact Bezier curve bounding box. Unlike RoIAlign, the shape of sampling grid of BezierAlign is not rectangular. Instead, each column of the arbitrarily-shaped grid is orthogonal to the Bezier curve boundary of the text. The sampling points have equidistant interval in width and height, respectively, which are bilinear interpolated with respect to the coordinates.\r \r Formally given an input feature map and Bezier curve control points, we concurrently process all the output pixels of the rectangular output feature map with size $h\\_{\\text {out }} \\times w\\_{\\text {out }}$. Taking pixel $g\\_{i}$ with position $\\left(g\\_{i w}, g\\_{i h}\\right)$ (from output feature map) as an example, we calculate $t$ by:\r \r $$\r t=\\frac{g\\_{i w}}{w\\_{o u t}}\r $$\r \r We then calculate the point of upper Bezier curve boundary $tp$ and lower Bezier curve boundary $bp$. Using $tp$ and $bp$, we can linearly index the sampling point $op$ by:\r \r $$\r op=bp \\cdot \\frac{g\\_{i h}}{h\\_{\\text {out }}}+tp \\cdot\\left(1-\\frac{g\\_{i h}}{h\\_{\\text {out }}}\\right)\r $$\r \r With the position of $op$, we can easily apply bilinear interpolation to calculate the result. Comparisons among previous sampling methods and BezierAlign are shown in the Figure.""" ; skos:prefLabel "BezierAlign" . :Bi-attention a skos:Concept ; dcterms:source ; skos:altLabel "Bilinear Attention" ; skos:definition "Bi-attention employs the attention-in-attention (AiA) mechanism to capture second-order statistical information: the outer point-wise channel attention vectors are computed from the output of the inner channel attention." ; skos:prefLabel "Bi-attention" . :Bi3D a skos:Concept ; dcterms:source ; skos:definition "**Bi3D** is a stereo depth estimation framework that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular depth *D*, as existing stereo methods do, it classifies them as being closer or farther than *D*. It takes the stereo pair and a disparity $d\\_{i}$ and produces a confidence map, which can be thresholded to yield the binary segmentation. To estimate depth on $N + 1$ quantization levels we run this network $N$ times and maximize the probability in Equation 8 (see paper). To estimate continuous depth, whether full or selective, we run the [SegNet](https://paperswithcode.com/method/segnet) block of Bi3DNet for each disparity level and work directly on the confidence volume." ; skos:prefLabel "Bi3D" . :BiDet a skos:Concept ; dcterms:source ; skos:definition "**BiDet** is a binarized neural network learning method for efficient object detection. Conventional network binarization methods directly quantize the weights and activations in one-stage or two-stage detectors with constrained representational capacity, so that the information redundancy in the networks causes numerous false positives and degrades the performance significantly. On the contrary, BiDet fully utilizes the representational capacity of the binary neural networks for object detection by redundancy removal, through which the detection precision is enhanced with alleviated false positives. Specifically, the information bottleneck (IB) principle is generalized to object detection, where the amount of information in the high-level feature maps is constrained and the mutual information between the feature maps and object detection is maximized." ; skos:prefLabel "BiDet" . :BiFPN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **BiFPN**, or **Weighted Bi-directional Feature Pyramid Network**, is a type of feature pyramid network which allows easy and fast multi-scale feature fusion. It incorporates the multi-level feature fusion idea from [FPN](https://paperswithcode.com/method/fpn), [PANet](https://paperswithcode.com/method/panet) and [NAS-FPN](https://paperswithcode.com/method/nas-fpn) that enables information to flow in both the top-down and bottom-up directions, while using regular and efficient connections. It also utilizes a fast normalized fusion technique. Traditional approaches usually treat all features input to the FPN equally, even those with different resolutions. However, input features at different resolutions often have unequal contributions to the output features. Thus, the BiFPN adds an additional weight for each input feature allowing the network to learn the importance of each. All regular convolutions are also replaced with less expensive depthwise separable convolutions.\r \r Comparing with PANet, PANet added an extra bottom-up path for information flow at the expense of more computational cost. Whereas BiFPN optimizes these cross-scale connections by removing nodes with a single input edge, adding an extra edge from the original input to output node if they are on the same level, and treating each bidirectional path as one feature network layer (repeating it several times for more high-level future fusion).""" ; skos:prefLabel "BiFPN" . :BiGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Bidirectional GAN" ; skos:definition """A **BiGAN**, or **Bidirectional GAN**, is a type of generative adversarial network where the generator not only maps latent samples to generated data, but also has an inverse mapping from data to the latent representation. The motivation is to make a type of GAN that can learn rich representations for us in applications like unsupervised learning.\r \r In addition to the generator $G$ from the standard [GAN](https://paperswithcode.com/method/gan) framework, BiGAN includes an encoder $E$ which maps data $\\mathbf{x}$ to latent representations $\\mathbf{z}$. The BiGAN discriminator $D$ discriminates not only in data space ($\\mathbf{x}$ versus $G\\left(\\mathbf{z}\\right)$), but jointly in data and latent space (tuples $\\left(\\mathbf{x}, E\\left(\\mathbf{x}\\right)\\right)$ versus $\\left(G\\left(z\\right), z\\right)$), where the latent component is either an encoder output $E\\left(\\mathbf{x}\\right)$ or a generator input $\\mathbf{z}$.""" ; skos:prefLabel "BiGAN" . :BiGCN a skos:Concept ; dcterms:source ; skos:altLabel "Bi-Directional Graph Convolutional Network" ; skos:definition "" ; skos:prefLabel "BiGCN" . :BiGG a skos:Concept ; dcterms:source ; skos:definition "**BiGG** is an autoregressive model for generative modeling for sparse graphs. It utilizes sparsity to avoid generating the full adjacency matrix, and reduces the graph generation time complexity to $O(((n + m)\\log n)$. Furthermore, during training this autoregressive model can be parallelized with $O(\\log n)$ synchronization stages, which makes it much more efficient than other autoregressive models that require $\\Omega(n)$. The approach is based on three key elements: (1) an $O(\\log n)$ process for generating each edge using a binary tree data structure, inspired by R-MAT; (2) a tree-structured autoregressive model for generating the set of edges associated with each node; and (3) an autoregressive model defined over the sequence of nodes." ; skos:prefLabel "BiGG" . :BiGRU a skos:Concept ; skos:altLabel "Bidirectional GRU" ; skos:definition """A **Bidirectional GRU**, or **BiGRU**, is a sequence processing model that consists of two [GRUs](https://paperswithcode.com/method/gru). one taking the input in a forward direction, and the other in a backwards direction. It is a bidirectional recurrent neural network with only the input and forget gates.\r \r Image Source: *Rana R (2016). Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech.*""" ; skos:prefLabel "BiGRU" . :BiLSTM a skos:Concept ; skos:altLabel "Bidirectional LSTM" ; skos:definition """A **Bidirectional LSTM**, or **biLSTM**, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow *and* precede a word in a sentence).\r \r Image Source: Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks, Cornegruta et al""" ; skos:prefLabel "BiLSTM" . :BiSeNetV2 a skos:Concept ; dcterms:source ; skos:definition "**BiSeNet V2** is a two-pathway architecture for real-time semantic segmentation. One pathway is designed to capture the spatial details with wide channels and shallow layers, called Detail Branch. In contrast, the other pathway is introduced to extract the categorical semantics with narrow channels and deep layers, called Semantic Branch. The Semantic Branch simply requires a large receptive field to capture semantic context, while the detail information can be supplied by the Detail Branch. Therefore, the Semantic Branch can be made very lightweight with fewer channels and a fast-downsampling strategy. Both types of feature representation are merged to construct a stronger and more comprehensive feature representation." ; skos:prefLabel "BiSeNet V2" . :Big-LittleModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Big-Little Modules** are blocks for image models that have two branches: each of which represents a separate block from a deep model and a less deep counterpart. They were proposed as part of the [BigLittle-Net](https://paperswithcode.com/method/big-little-net) architecture. The two branches are fused with a linear combination and unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution)." ; skos:prefLabel "Big-Little Module" . :Big-LittleNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Big-Little Net** is a convolutional neural network architecture for learning multi-scale feature representations. This is achieved by using a multi-branch network, which has different computational complexity at different branches with different resolutions. Through frequent merging of features from branches at distinct scales, the model obtains multi-scale features while using less computation.\r \r It consists of Big-Little Modules, which have two branches: each of which represents a separate block from a deep model and a less deep counterpart. The two branches are fused with linear combination + unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).""" ; skos:prefLabel "Big-Little Net" . :BigBiGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**BigBiGAN** is a type of [BiGAN](https://paperswithcode.com/method/bigan) with a [BigGAN](https://paperswithcode.com/method/biggan) image generator. The authors initially used [ResNet](https://paperswithcode.com/method/resnet) as a baseline for the encoder $\\mathcal{E}$ followed by a 4-layer MLP with skip connections, but they experimented with RevNets and found they outperformed with increased network width, so opted for this type of encoder for the final architecture." ; skos:prefLabel "BigBiGAN" . :BigBird a skos:Concept ; dcterms:source ; skos:definition """**BigBird** is a [Transformer](https://paperswithcode.com/method/transformer) with a sparse attention mechanism that reduces the quadratic dependency of self-attention to linear in the number of tokens. BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. In particular, BigBird consists of three main parts:\r \r - A set of $g$ global tokens attending on all parts of the sequence.\r - All tokens attending to a set of $w$ local neighboring tokens.\r - All tokens attending to a set of $r$ random tokens.\r \r This leads to a high performing attention mechanism scaling to much longer sequence lengths (8x).""" ; skos:prefLabel "BigBird" . :BigGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**BigGAN** is a type of generative adversarial network that was designed for scaling generation to high-resolution, high-fidelity images. It includes a number of incremental changes and innovations. The baseline and incremental changes are:\r \r - Using [SAGAN](https://paperswithcode.com/method/sagan) as a baseline with spectral norm. for G and D, and using [TTUR](https://paperswithcode.com/method/ttur).\r - Using a Hinge Loss [GAN](https://paperswithcode.com/method/gan) objective\r - Using class-[conditional batch normalization](https://paperswithcode.com/method/conditional-batch-normalization) to provide class information to G (but with linear projection not MLP.\r - Using a [projection discriminator](https://paperswithcode.com/method/projection-discriminator) for D to provide class information to D.\r - Evaluating with EWMA of G's weights, similar to ProGANs.\r \r The innovations are:\r \r - Increasing batch sizes, which has a big effect on the Inception Score of the model.\r - Increasing the width in each layer leads to a further Inception Score improvement.\r - Adding skip connections from the latent variable $z$ to further layers helps performance.\r - A new variant of [Orthogonal Regularization](https://paperswithcode.com/method/orthogonal-regularization).""" ; skos:prefLabel "BigGAN" . :BigGAN-deep a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**BigGAN-deep** is a deeper version (4x) of [BigGAN](https://paperswithcode.com/method/biggan). The main difference is a slightly differently designed [residual block](https://paperswithcode.com/method/residual-block). Here the $z$ vector is concatenated with the conditional vector without splitting it into chunks. It is also based on residual blocks with bottlenecks. BigGAN-deep uses a different strategy than BigGAN aimed at preserving identity throughout the skip connections. In G, where the number of channels needs to be reduced, BigGAN-deep simply retains the first group of channels and drop the rest to produce the required number of channels. In D, where the number of channels should be increased, BigGAN-deep passes the input channels unperturbed, and concatenates them with the remaining channels produced by a 1 × 1 [convolution](https://paperswithcode.com/method/convolution). As far as the\r network configuration is concerned, the discriminator is an exact reflection of the generator. \r \r There are two blocks at each resolution (BigGAN uses one), and as a result BigGAN-deep is four times\r deeper than BigGAN. Despite their increased depth, the BigGAN-deep models have significantly\r fewer parameters mainly due to the bottleneck structure of their residual blocks.""" ; skos:prefLabel "BigGAN-deep" . :BilateralGrid a skos:Concept ; dcterms:source ; skos:definition """Bilateral grid is a new data structure that enables fast edge-aware image processing. It enables edge-aware image manipulations such as local tone mapping on high resolution images in real time.\r \r Source: [Chen et al.](https://people.csail.mit.edu/sparis/publi/2007/siggraph/Chen_07_Bilateral_Grid.pdf)\r \r Image source: [Chen et al.](https://people.csail.mit.edu/sparis/publi/2007/siggraph/Chen_07_Bilateral_Grid.pdf)""" ; skos:prefLabel "Bilateral Grid" . :BilateralGuidedAggregationLayer a skos:Concept ; dcterms:source ; skos:definition "**Bilateral Guided Aggregation Layer** is a feature fusion layer for semantic segmentation that aims to enhance mutual connections and fuse different types of feature representation. It was used in the [BiSeNet V2](https://paperswithcode.com/method/bisenet-v2) architecture. Specifically, within the BiSeNet implementation, the layer was used to employ the contextual information of the Semantic Branch to guide the feature response of Detail Branch. With different scale guidance, different scale feature representations can be captured, which inherently encodes the multi-scale information." ; skos:prefLabel "Bilateral Guided Aggregation Layer" . :BinaryBERT a skos:Concept ; dcterms:source ; skos:definition "**BinaryBERT** is a [BERT](https://paperswithcode.com/method/bert)-variant that applies quantization in the form of weight binarization. Specifically, ternary weight splitting is proposed which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. To obtain BinaryBERT, we first train a half-sized [ternary BERT](https://paperswithcode.com/method/ternarybert) model, and then apply a [ternary weight splitting](https://paperswithcode.com/method/ternary-weight-splitting) operator to obtain the latent full-precision and quantized weights as the initialization of the full-sized BinaryBERT. We then fine-tune BinaryBERT for further refinement." ; skos:prefLabel "BinaryBERT" . :BlendMask a skos:Concept ; dcterms:source ; skos:definition "**BlendMask** is an [instance segmentation framework](https://paperswithcode.com/methods/category/instance-segmentation-models) built on top of the[ FCOS](https://paperswithcode.com/method/fcos) object detector. The bottom module uses either backbone or [FPN](https://paperswithcode.com/method/fpn) features to predict a set of bases. A single [convolution](https://paperswithcode.com/methods/category/convolutions) layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the [blender](https://paperswithcode.com/method/blender) crops the bases with its bounding box and linearly combine them according the learned attention maps. Note that the Bottom Module can take features either from ‘C’, or ‘P’ as the input." ; skos:prefLabel "BlendMask" . :BlendedDiffusion a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """Blended Diffusion enables a zero-shot local text-guided image editing of natural images.\r Given an input image $x$, an input mask $m$ and a target guiding text $t$ - the method enables to change the masked area within the image corresponding the the guiding text s.t. the unmasked area is left unchanged.""" ; skos:prefLabel "Blended Diffusion" . :Blender a skos:Concept ; dcterms:source ; skos:definition """**Blender** is a proposal-based instance mask generation module which incorporates rich instance-level information with accurate dense pixel features. A single [convolution](https://paperswithcode.com/method/convolution) layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the blender crops predicted bases with its bounding box and linearly combines them according the learned attention maps.\r \r The inputs of the blender module are bottom-level bases $\\mathbf{B}$, the selected top-level attentions $A$ and bounding box proposals $P$. First [RoIPool](https://paperswithcode.com/method/roi-pooling) of Mask R-CNN to crop bases with each proposal $\\mathbf{p}\\_{d}$ and then resize the region to a fixed size $R \\times R$ feature map $\\mathbf{r}\\_{d}$\r \r $$\r \\mathbf{r}\\_{d}=\\operatorname{RoIPool}_{R \\times R}\\left(\\mathbf{B}, \\mathbf{p}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r $$\r \r More specifically, asampling ratio 1 is used for [RoIAlign](https://paperswithcode.com/method/roi-align), i.e. one bin for each sampling point. During training, ground truth boxes are used as the proposals. During inference, [FCOS](https://paperswithcode.com/method/fcos) prediction results are used.\r \r The attention size $M$ is smaller than $R$. We interpolate $\\mathbf{a}\\_{d}$ from $M \\times M$ to $R \\times R$, into the shapes of $R=\\left\\(\\mathbf{r}\\_{d} \\mid d=1 \\ldots D\\right)$\r \r $$\r \\mathbf{a}\\_{d}^{\\prime}=\\text { interpolate }\\_{M \\times M \\rightarrow R \\times R}\\left(\\mathbf{a}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r $$\r \r Then $\\mathbf{a}\\_{d}^{\\prime}$ is normalized with a softmax function along the $K$ dimension to make it a set of score maps $\\mathbf{s}\\_{d}$.\r \r $$\r \\mathbf{s}\\_{d}=\\operatorname{softmax}\\left(\\mathbf{a}\\_{d}^{\\prime}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r $$\r \r Then we apply element-wise product between each entity $\\mathbf{r}\\_{d}, \\mathbf{s}\\_{d}$ of the regions $R$ and scores $S$, and sum along the $K$ dimension to get our mask logit $\\mathbf{m}\\_{d}:$\r \r $$\r \\mathbf{m}\\_{d}=\\sum\\_{k=1}^{K} \\mathbf{s}\\_{d}^{k} \\circ \\mathbf{r}\\_{d}^{k}, \\quad \\forall d \\in\\{1 \\ldots D\\}\r $$\r \r where $k$ is the index of the basis. The mask blending process with $K=4$ is visualized in the Figure.""" ; skos:prefLabel "Blender" . :BlinkCommunication a skos:Concept ; dcterms:source ; skos:definition "**Blink** is a communication library for inter-GPU parameter exchange that achieves near-optimal link utilization. To handle topology heterogeneity from hardware generations or partial allocations from cluster schedulers, Blink dynamically generates optimal communication primitives for a given topology. Blink probes the set of links available for a given job at runtime and builds a topology with appropriate link capacities. Given the topology, Blink achieves the optimal communication rate by packing spanning trees, that can utilize more links (Lovasz, 1976; Edmonds, 1973) when compared to rings. The authors use a multiplicative-weight update based approximation algorithm to quickly compute the maximal packing and extend the algorithm to further minimize the number of trees generated. Blink’s collectives extend across multiple machines effectively utilizing all available network interfaces." ; skos:prefLabel "Blink Communication" . :BlueRiverControls a skos:Concept ; dcterms:source ; skos:definition "**Blue River Controls** is a tool that allows users to train and test reinforcement learning algorithms on real-world hardware. It features a simple interface based on OpenAI Gym, that works directly on both simulation and hardware." ; skos:prefLabel "Blue River Controls" . :BoomLayer a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Boom Layer** is a type of feedforward layer that is closely related to the feedforward layers used in Transformers. The layer takes a vector of the form $v \\in \\mathbb{R}^{H}$ and uses a matrix\r multiplication with a GeLU activation to produce a vector $u \\in \\mathbb{R}^{N\\times{H}}$. We then break $u$ into $N$ vectors and sum those together, producing $w \\in \\mathbb{R}^{H}$. This minimizes computation and removes an entire matrix of parameters compared to traditional down-projection layers.\r \r The Figure to the right shows the Boom Layer used in the context of [SHA-RNN](https://paperswithcode.com/method/sha-rnn) from the original paper.""" ; skos:prefLabel "Boom Layer" . :Boost-GNN a skos:Concept ; dcterms:source ; skos:definition "**Boost-GNN** is an architecture that trains GBDT and GNN jointly to get the best of both worlds: the GBDT model deals with heterogeneous features, while GNN accounts for the graph structure. The model benefits from end-to-end optimization by allowing new trees to fit the gradient updates of GNN." ; skos:prefLabel "Boost-GNN" . :Bort a skos:Concept ; dcterms:source ; skos:definition "**Bort** is a parametric architectural variant of the [BERT](https://paperswithcode.com/method/bert) architecture. It extracts an optimal subset of architectural parameters for the BERT architecture through a [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) approach; in particular, a fully polynomial-time approximation scheme (FPTAS). This optimal subset - “Bort” - is demonstrably smaller, having an effective size of $5.5 \\%$ the original BERT-large architecture, and $16\\%$ of the net size. Bort is also able to be pretrained in $288$ GPU hours, which is $1.2\\%$ less than the time required to pretrain the highest-performing BERT parametric architecture variant, RoBERTa-large ([RoBERTa](https://paperswithcode.com/method/roberta)), and about $33\\%" ; skos:prefLabel "Bort" . :BottleneckResidualBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Bottleneck Residual Block** is a variant of the [residual block](https://paperswithcode.com/method/residual-block) that utilises 1x1 convolutions to create a bottleneck. The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the [ResNet](https://paperswithcode.com/method/resnet) architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101." ; skos:prefLabel "Bottleneck Residual Block" . :BottleneckTransformer a skos:Concept ; dcterms:source ; skos:definition "The **Bottleneck Transformer (BoTNet) ** is an image classification model that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a [ResNet](https://paperswithcode.com/method/resnet) and no other changes, the approach improves upon baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency." ; skos:prefLabel "Bottleneck Transformer" . :BottleneckTransformerBlock a skos:Concept ; dcterms:source ; skos:definition "A **Bottleneck Transformer Block** is a block used in [Bottleneck Transformers](https://www.paperswithcode.com/method/bottleneck-transformer) that replaces the spatial 3 × 3 [convolution](https://paperswithcode.com/method/convolution) layer in a [Residual Block](https://paperswithcode.com/method/residual-block) with Multi-Head Self-Attention (MHSA)." ; skos:prefLabel "Bottleneck Transformer Block" . :Bottom-upPathAugmentation a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Bottom-up Path Augmentation** is a feature extraction technique that seeks to shorten the information path and enhance a feature pyramid with accurate localization signals existing in low-levels. This is based on the fact that high response to edges or instance parts is a strong indicator to accurately localize instances. \r \r Each building block takes a higher resolution feature map $N\\_{i}$ and a coarser map $P\\_{i+1}$ through lateral connection and generates the new feature map $N\\_{i+1}$ Each feature map $N\\_{i}$ first goes through a $3 \\times 3$ convolutional layer with stride $2$ to reduce the spatial size. Then each element of feature map $P\\_{i+1}$ and the down-sampled map are added through lateral connection. The fused feature map is then processed by another $3 \\times 3$ convolutional layer to generate $N\\_{i+1}$ for following sub-networks. This is an iterative process and terminates after approaching $P\\_{5}$. In these building blocks, we consistently use channel 256 of feature maps. The feature grid for each proposal is then pooled from new feature maps, i.e., {$N\\_{2}$, $N\\_{3}$, $N\\_{4}$, $N\\_{5}$}.""" ; skos:prefLabel "Bottom-up Path Augmentation" . :BoundaryNet a skos:Concept ; dcterms:source ; skos:definition "**BoundaryNet** is a resizing-free approach for layout annotation. The variable-sized user selected region of interest is first processed by an attention-guided skip network. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph [Convolution](https://paperswithcode.com/method/convolution) Network optimized using Hausdorff loss to obtain the final region boundary." ; skos:prefLabel "BoundaryNet" . :Branchattention a skos:Concept ; dcterms:source ; skos:definition "Branch attention can be seen as a dynamic branch selection mechanism: which to pay attention to, used with a multi-branch structure." ; skos:prefLabel "Branch attention" . :Bridge-net a skos:Concept ; dcterms:source ; skos:definition "**Bridge-net** is an audio model block used in the [ClariNet](https://paperswithcode.com/method/clarinet) text-to-speech architecture. Bridge-net maps frame-level hidden representation to sample-level through several [convolution](https://paperswithcode.com/method/convolution) blocks and [transposed convolution](https://paperswithcode.com/method/transposed-convolution) layers interleaved with softsign non-linearities." ; skos:prefLabel "Bridge-net" . :BytePS a skos:Concept ; skos:definition "**BytePS** is a distributed training method for deep neural networks. BytePS handles cases with varying number of CPU machines and makes traditional all-reduce and PS as two special cases of its framework. To further accelerate DNN training, BytePS proposes Summation Service and splits a DNN optimizer into two parts: gradient summation and parameter update. It keeps the CPU-friendly part, gradient summation, in CPUs, and moves parameter update, which is more computation heavy, to GPUs." ; skos:prefLabel "BytePS" . :ByteScheduler a skos:Concept ; skos:definition "**ByteScheduler** is a generic communication scheduler for distributed DNN training acceleration. It is based on analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead." ; skos:prefLabel "ByteScheduler" . :CABiNet a skos:Concept ; skos:altLabel "Context Aggregated Bi-lateral Network for Semantic Segmentation" ; skos:definition "With the increasing demand of autonomous systems, pixelwise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for potential real-time applications. In this paper, we propose Context Aggregation Network, a dual branch convolutional neural network, with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing dual branch architectures for high-speed semantic segmentation, we design a high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. We evaluate our method on two semantic segmentation datasets, namely Cityscapes dataset and UAVid dataset. For Cityscapes test set, our model achieves state-of-the-art results with mIOU of 75.9%, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. With regards to UAVid dataset, our proposed network achieves mIOU score of 63.5% with high execution speed (15 FPS)." ; skos:prefLabel "CABiNet" . :CAG a skos:Concept ; dcterms:source ; skos:altLabel "Class activation guide" ; skos:definition """Class activation guide is a module which uses weak localization information from the instrument activation maps to guide the verb and target recognition. \r \r Image source: [Nwoye et al.](https://arxiv.org/pdf/2007.05405v1.pdf)""" ; skos:prefLabel "CAG" . :CAM a skos:Concept ; dcterms:source ; skos:altLabel "Class-activation map" ; skos:definition """Class activation maps could be used to interpret the prediction decision made by the convolutional neural network (CNN).\r \r Image source: [Learning Deep Features for Discriminative Localization](https://paperswithcode.com/paper/learning-deep-features-for-discriminative)""" ; skos:prefLabel "CAM" . :CAMoE a skos:Concept ; dcterms:source ; skos:definition "**CAMoE** is a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (MoE) for video-text retrieval. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. A [Dual Softmax Loss](https://paperswithcode.com/method/dual-softmax-loss) (DSL) is used to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match." ; skos:prefLabel "CAMoE" . :CANINE a skos:Concept ; dcterms:source ; skos:definition "**CANINE** is a pre-trained encoder for language understanding that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy with soft inductive biases in place of hard token boundaries. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep [transformer](https://paperswithcode.com/method/transformer) stack, which encodes context." ; skos:prefLabel "CANINE" . :CARAFE a skos:Concept ; dcterms:source ; skos:definition "**Content-Aware ReAssembly of FEatures (CARAFE)** is an operator for feature upsampling in convolutional neural networks. CARAFE has several appealing properties: (1) Large field of view. Unlike previous works (e.g. bilinear interpolation) that only exploit subpixel neighborhood, CARAFE can aggregate contextual information within a large receptive field. (2) Content-aware handling. Instead of using a fixed kernel for all samples (e.g. deconvolution), CARAFE enables instance-specific content-aware handling, which generates adaptive kernels on-the-fly. (3) Lightweight and fast to compute." ; skos:prefLabel "CARAFE" . :CARLA a skos:Concept ; dcterms:source ; skos:altLabel "CARLA: An Open Urban Driving Simulator" ; skos:definition """CARLA is an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. \r \r Source: [Dosovitskiy et al.](https://arxiv.org/pdf/1711.03938v1.pdf)\r \r Image source: [Dosovitskiy et al.](https://arxiv.org/pdf/1711.03938v1.pdf)""" ; skos:prefLabel "CARLA" . :CBAM a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Convolutional Block Attention Module" ; skos:definition """**Convolutional Block Attention Module (CBAM)** is an attention module for convolutional neural networks. Given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.\r \r Given an intermediate feature map $\\mathbf{F} \\in \\mathbb{R}^{C×H×W}$ as input, CBAM sequentially infers a 1D channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C×1×1}$ and a 2D spatial attention map $\\mathbf{M}\\_{s} \\in \\mathbb{R}^{1×H×W}$. The overall attention process can be summarized as:\r \r $$ \\mathbf{F}' = \\mathbf{M}\\_{c}\\left(\\mathbf{F}\\right) \\otimes \\mathbf{F} $$\r \r $$ \\mathbf{F}'' = \\mathbf{M}\\_{s}\\left(\\mathbf{F'}\\right) \\otimes \\mathbf{F'} $$\r \r During multiplication, the attention values are broadcasted (copied) accordingly: channel attention values are broadcasted along the spatial dimension, and vice versa. $\\mathbf{F}''$ is the final refined\r output.""" ; skos:prefLabel "CBAM" . :CBHG a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**CBHG** is a building block used in the [Tacotron](https://paperswithcode.com/method/tacotron) text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit ([BiGRU](https://paperswithcode.com/method/bigru)). \r \r The module is used to extract representations from sequences. The input sequence is first\r convolved with $K$ sets of 1-D convolutional filters, where the $k$-th set contains $C\\_{k}$ filters of width $k$ (i.e. $k = 1, 2, \\dots , K$). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The [convolution](https://paperswithcode.com/method/convolution) outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. [Batch normalization](https://paperswithcode.com/method/batch-normalization) is used for all convolutional layers. The convolution outputs are fed into a multi-layer [highway network](https://paperswithcode.com/method/highway-network) to extract high-level features. Finally, a bidirectional [GRU](https://paperswithcode.com/method/gru) RNN is stacked on top to extract sequential features from both forward and backward context.""" ; skos:prefLabel "CBHG" . :CBNet a skos:Concept ; dcterms:source ; skos:altLabel "Composite Backbone Network" ; skos:definition """**CBNet** is a backbone architecture that consists of multiple identical backbones (specially called Assistant Backbones and Lead Backbone) and composite connections between neighbor backbones. From left to right, the output of each stage in an Assistant Backbone, namely higher-level\r features, flows to the parallel stage of the succeeding backbone as part of inputs through composite connections. Finally, the feature maps of the last backbone named Lead\r Backbone are used for object detection. The features extracted by CBNet for object detection fuse the high-level and low-level features of multiple backbones, hence improve the detection performance.""" ; skos:prefLabel "CBNet" . :CBoWWord2Vec a skos:Concept ; dcterms:source ; skos:altLabel "Continuous Bag-of-Words Word2Vec" ; skos:definition """**Continuous Bag-of-Words Word2Vec** is an architecture for creating word embeddings that uses $n$ future words as well as $n$ past words to create a word embedding. The objective function for CBOW is:\r \r $$ J\\_\\theta = \\frac{1}{T}\\sum^{T}\\_{t=1}\\log{p}\\left(w\\_{t}\\mid{w}\\_{t-n},\\ldots,w\\_{t-1}, w\\_{t+1},\\ldots,w\\_{t+n}\\right) $$\r \r In the CBOW model, the distributed representations of context are used to predict the word in the middle of the window. This contrasts with [Skip-gram Word2Vec](https://paperswithcode.com/method/skip-gram-word2vec) where the distributed representation of the input word is used to predict the context.""" ; skos:prefLabel "CBoW Word2Vec" . :CCAC a skos:Concept ; dcterms:source ; skos:altLabel "Confidence Calibration with an Auxiliary Class)" ; skos:definition "**Confidence Calibration with an Auxiliary Class**, or **CCAC**, is a post-hoc confidence calibration method for DNN classifiers on OOD datasets. The key feature of CCAC is an auxiliary class in the calibration model which separates mis-classified samples from correctly classified ones, thus effectively mitigating the target DNN’s being confidently wrong. It also reduces the number of free parameters in CCAC to reduce free parameters and facilitate transfer to a new unseen dataset." ; skos:prefLabel "CCAC" . :CCNet a skos:Concept ; dcterms:source ; skos:altLabel "Criss-Cross Network" ; skos:definition """**Criss-Cross Network** (**CCNet**) aims to obtain full-image contextual information in an effective and efficient way. Concretely,\r for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. **CCNet** is with the following\r merits: **1)** GPU memory friendly. Compared with the [non-local block](https://paperswithcode.com/method/non-local-block), the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. **2)** High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. **3)** The state-of-the-art performance.""" ; skos:prefLabel "CCNet" . :CCT a skos:Concept ; dcterms:source ; skos:altLabel "Compact Convolutional Transformers" ; skos:definition "**Compact Convolutional Transformers** utilize sequence pooling and replace the patch embedding with a convolutional embedding, allowing for better inductive bias and making positional embeddings optional. CCT achieves better accuracy than ViT-Lite (smaller ViTs) and increases the flexibility of the input parameters." ; skos:prefLabel "CCT" . :CDCC-NET a skos:Concept ; dcterms:source ; skos:definition "CDCC-NET is a multi-task network that analyzes the detected counter region and predicts 9 outputs: eight float numbers referring to the corner positions (x0/w, y0/h, ... , x3/w, y3/h) and an array containing two float numbers regarding the probability of the counter being legible/operational or illegible/faulty." ; skos:prefLabel "CDCC-NET" . :CDEP a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Contextual Decomposition Explanation Penalization" ; skos:definition """**Contextual Decomposition Explanation Penalization (CDEP)** is a method which leverages existing explanation techniques for neural networks in order to prevent a model from learning\r unwanted relationships and ultimately improve predictive accuracy. Given particular importance\r scores, CDEP works by allowing the user to directly penalize importances of certain features, or\r interactions. This forces the neural network to not only produce the correct prediction, but also the\r correct explanation for that prediction""" ; skos:prefLabel "CDEP" . :CDIL-CNN a skos:Concept ; dcterms:source ; skos:altLabel "Circular Dilated Convolutional Neural Networks" ; skos:definition "" ; skos:prefLabel "CDIL-CNN" . :CELU a skos:Concept ; skos:altLabel "Continuously Differentiable Exponential Linear Units" ; skos:definition """Exponential Linear Units (ELUs) are a useful rectifier for constructing deep learning architectures, as they may speed up and otherwise improve learning by virtue of not have vanishing gradients and by having mean activations near zero. However, the ELU activation as parametrized in [1] is not continuously differentiable with respect to its input when the shape parameter alpha is not equal to 1. We present an alternative parametrization which is C1 continuous for all values of alpha, making the rectifier easier to reason about and making alpha easier to tune. This alternative parametrization has several other useful properties that the original parametrization of ELU does not: 1) its derivative with respect to x is bounded, 2) it contains both the linear transfer function and ReLU as special cases, and 3) it is scale-similar with respect to alpha.\r $$\\text{CELU}(x) = \\max(0,x) + \\min(0, \\alpha * (\\exp(x/\\alpha) - 1))$$""" ; skos:prefLabel "CELU" . :CGMM a skos:Concept ; dcterms:source ; skos:altLabel "Contextual Graph Markov Model" ; skos:definition """Contextual Graph Markov Model (CGMM) is an approach combining ideas from generative models and neural networks for the processing of graph data. It founds on a constructive methodology to build a deep architecture comprising layers of probabilistic models that learn to encode the structured information in an incremental fashion. Context is diffused in an efficient and scalable way across the graph vertexes and edges. The resulting graph encoding is used in combination with discriminative models to address structure classification benchmarks.\r \r Description and image from: [Contextual Graph Markov Model: A Deep and Generative Approach to Graph Processing](https://arxiv.org/pdf/1805.10636.pdf)""" ; skos:prefLabel "CGMM" . :CGNN a skos:Concept ; dcterms:source ; skos:altLabel "Crystal Graph Neural Network" ; skos:definition "The full architecture of CGNN is presented at [CGNN's official site](https://tony-y.github.io/cgnn/architectures/)." ; skos:prefLabel "CGNN" . :CGRU a skos:Concept ; dcterms:source ; skos:altLabel "Convolutional GRU" ; skos:definition """A **Convolutional Gated Recurrent Unit** is a type of [GRU](https://paperswithcode.com/method/gru) that combines GRUs with the [convolution](https://paperswithcode.com/method/convolution) operation. The update rule for input $x\\_{t}$ and the previous output $h\\_{t-1}$ is given by the following:\r \r $$ r = \\sigma\\left(W\\_{r} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{r}\\right) $$\r \r $$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{u} \\right) $$\r \r $$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[x\\_{t}; r \\odot h\\_{t-1}\\right] + b\\_{c} \\right) $$\r \r $$ h\\_{t} = u \\odot h\\_{t-1} + \\left(1-u\\right) \\odot c $$\r \r In these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https://paperswithcode.com/method/relu) functions respectively and the $\\star\\_{n}$ represents a convolution with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.""" ; skos:prefLabel "CGRU" . :CHM a skos:Concept ; dcterms:source ; skos:altLabel "Convolutional Hough Matching" ; skos:definition "**Convolutional Hough Matching**, or **CHM**, is a geometric matching algorithm that distributes similarities of candidate matches over a geometric transformation space and evaluates them in a convolutional manner. It is casted into a trainable neural layer with a semi-isotropic high-dimensional kernel, which learns non-rigid matching with a small number of interpretable parameters." ; skos:prefLabel "CHM" . :CIDA a skos:Concept ; dcterms:source ; skos:altLabel "Continuously Indexed Domain Adaptation" ; skos:definition """**Continuously Indexed Domain Adaptation** combines traditional adversarial adaptation with a novel discriminator that models the encoding-conditioned domain index distribution.\r \r Image Source: [Wang et al.](https://arxiv.org/pdf/2007.01807v2.pdf)""" ; skos:prefLabel "CIDA" . :CInCFlow a skos:Concept ; dcterms:source ; skos:altLabel "Characterizable Invertible 3x3 Convolution" ; skos:definition "Characterizable Invertible $3\\times3$ Convolution" ; skos:prefLabel "CInC Flow" . :CKConv a skos:Concept ; dcterms:source ; skos:altLabel "Continuous Kernel Convolution" ; skos:definition "" ; skos:prefLabel "CKConv" . :CLIP a skos:Concept ; dcterms:source ; skos:altLabel "Contrastive Language-Image Pre-training" ; skos:definition """**Contrastive Language-Image Pre-training** (**CLIP**), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes. \r \r For pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores. \r \r Image credit: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf)""" ; skos:prefLabel "CLIP" . :CLIPort a skos:Concept ; dcterms:source ; skos:definition "CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]." ; skos:prefLabel "CLIPort" . :CLRNet a skos:Concept ; dcterms:source ; skos:altLabel "Convolutional LSTM based Residual Network" ; skos:definition "" ; skos:prefLabel "CLRNet" . :CMCL a skos:Concept ; dcterms:source ; skos:altLabel "Crossmodal Contrastive Learning" ; skos:definition "**CMCL**, or **Crossmodal Contrastive Learning**, is a method for unifying visual and textual representations into the same semantic space based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representations and textual representations, and unifies them into the same semantic space based on image-text pairs. As shown in the Figure, to facilitate different levels of semantic alignment between vision and language, a series of text rewriting techniques are utilized to improve the diversity of cross-modal information. Specifically, for an image-text pair, various positive examples and hard negative examples can be obtained by rewriting the original caption at different levels. Moreover, to incorporate more background information from the single-modal data, text and image retrieval are also applied to augment each image-text pair with various related texts and images. The positive pairs, negative pairs, related images and texts are learned jointly by CMCL. In this way, the model can effectively unify different levels of visual and textual representations into the same semantic space, and incorporate more single-modal knowledge to enhance each other." ; skos:prefLabel "CMCL" . :CNNBiLSTM a skos:Concept ; dcterms:source ; skos:altLabel "CNN Bidirectional LSTM" ; skos:definition "A **CNN BiLSTM** is a hybrid bidirectional [LSTM](https://paperswithcode.com/method/lstm) and CNN architecture. In the original formulation applied to named entity recognition, it learns both character-level and word-level features. The CNN component is used to induce the character-level features. For each word the model employs a [convolution](https://paperswithcode.com/method/convolution) and a [max pooling](https://paperswithcode.com/method/max-pooling) layer to extract a new feature vector from the per-character feature vectors such as character embeddings and (optionally) character type." ; skos:prefLabel "CNN BiLSTM" . :COCO-FUNIT a skos:Concept ; dcterms:source ; skos:definition """**COCO-FUNIT** is few-shot image translation model which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. It builds on top of [FUNIT](https://arxiv.org/abs/1905.01723) by identifying the content loss problem and then addressing it with a novel content-conditioned style encoder architecture.\r \r The FUNIT method suffers from the content loss problem—the translation result is not well-aligned with the input image. While a direct theoretical analysis is likely elusive, we conduct an empirical study, aiming at identify the cause of the content loss problem. In analyses, the authors show that the FUNIT style encoder produces very different style codes using different crops -- suggesting the style code contains other information about the style image such as the object pose.\r \r To make the style embedding more robust to small variations in the style image, a new style encoder architecture, the Content-Conditioned style encoder (COCO), is introduced. The most distinctive feature of this new encoder is the conditioning in the content image as illustrated in the top-right of the Figure. Unlike the style encoder in FUNIT, COCO takes both content and style image as input. With this content-conditioning scheme, a direct feedback path is created during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.""" ; skos:prefLabel "COCO-FUNIT" . :COLA a skos:Concept ; dcterms:source ; skos:definition "**COLA** is a self-supervised pre-training approach for learning a general-purpose representation of audio. It is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings." ; skos:prefLabel "COLA" . :CORAD a skos:Concept ; dcterms:source ; skos:altLabel "CORAD: Correlation-Aware Compression of Massive Time Series using Sparse Dictionary Coding" ; skos:definition "" ; skos:prefLabel "CORAD" . :CP-N3 a skos:Concept ; dcterms:source ; skos:altLabel "Canonical Tensor Decomposition with N3 Regularizer" ; skos:definition "Canonical Tensor Decomposition, trained with N3 regularizer" ; skos:prefLabel "CP-N3" . :CP-N3-RP a skos:Concept ; dcterms:source ; skos:altLabel "CP with N3 Regularizer and Relation Prediction" ; skos:definition "CP with N3 Regularizer and Relation Prediction" ; skos:prefLabel "CP-N3-RP" . :CPCv2 a skos:Concept ; dcterms:source ; skos:definition """**Contrastive Predictive Coding v2 (CPC v2)** is a self-supervised learning approach that builds upon the original [CPC](https://paperswithcode.com/method/contrastive-predictive-coding) with several improvements. These improvements include:\r \r - **Model capacity** - The third residual stack of [ResNet](https://paperswithcode.com/method/resnet)-101 (originally containing 23 blocks, 1024-dimensional feature maps, and 256-dimensional bottleneck layers), is converted to use 46 blocks, with 4096-dimensional feature maps and 512-dimensional bottleneck layers: ResNet-161.\r \r - **Layer Normalization** - The authors find CPC with [batch normalization](https://paperswithcode.com/method/batch-normalization) harms downstream performance. They hypothesize this is due to batch normalization allowing large models to find a trivial solution to CPC: it introduces a dependency between patches (through the batch statistics) that can be exploited to bypass the constraints on the receptive field. They replace batch normalization with [layer normalization](https://paperswithcode.com/method/layer-normalization).\r \r - **Predicting lengths and directions** - patches are predicted with contexts from both directions rather than just spatially underneath.\r \r - **Patch-based Augmentation** - Utilising "color dropping" which randomly drops two of the three color channels in each patch, as well as random horizontal flips.\r \r \r Consistent with prior results, this new architecture delivers better performance regardless of""" ; skos:prefLabel "CPC v2" . :CPM-2 a skos:Concept ; dcterms:source ; skos:definition "**CPM-2** is a 11 billion parameters pre-trained language model based on a standard Transformer architecture consisting of a bidirectional encoder and a unidirectional decoder. The model is pre-trained on WuDaoCorpus which contains 2.3TB cleaned Chinese data as well as 300GB cleaned English data. The pre-training process of CPM-2 can be divided into three stages: Chinese pre-training, bilingual pre-training, and MoE pre-training. Multi-stage training with knowledge inheritance can significantly reduce the computation cost." ; skos:prefLabel "CPM-2" . :CPN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Contour Proposal Network" ; skos:definition "The Contour Proposal Network (CPN) detects possibly overlapping objects in an image while simultaneously fitting pixel-precise closed object contours. The CPN can incorporate state of the art object detection architectures as backbone networks into a fast single-stage instance segmentation model that can be trained end-to-end." ; skos:prefLabel "CPN" . :CPN3 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "CP with N3 Regularizer" ; skos:definition "CP with N3 Regularizer" ; skos:prefLabel "CP N3" . :CPVT a skos:Concept ; dcterms:source ; skos:altLabel "Conditional Position Encoding Vision Transformer" ; skos:definition "**CPVT**, or **Conditional Position Encoding Vision Transformer**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) which utilizes [conditional positional encoding](https://paperswithcode.com/method/conditional-positional-encoding). Other than the new encodings, it follows the same architecture of [ViT](https://paperswithcode.com/method/vision-transformer) and [DeiT](https://paperswithcode.com/method/deit)." ; skos:prefLabel "CPVT" . :CPconv a skos:Concept ; dcterms:source ; skos:altLabel "Center-pivot convolution" ; skos:definition "" ; skos:prefLabel "CP conv" . :CR-NET a skos:Concept ; dcterms:source ; skos:definition "CR-NET is a YOLO-based model proposed for license plate character detection and recognition" ; skos:prefLabel "CR-NET" . :CRF a skos:Concept ; skos:altLabel "Conditional Random Field" ; skos:definition """**Conditional Random Fields** or **CRFs** are a type of probabilistic graph model that take neighboring sample context into account for tasks like classification. Prediction is modeled as a graphical model, which implements dependencies between the predictions. Graph choice depends on the application, for example linear chain CRFs are popular in natural language processing, whereas in image-based tasks, the graph would connect to neighboring locations in an image to enforce that they have similar predictions.\r \r Image Credit: [Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields](https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf)""" ; skos:prefLabel "CRF" . :CRF-RNN a skos:Concept ; dcterms:source ; skos:definition "**CRF-RNN** is a formulation of a [CRF](https://paperswithcode.com/method/crf) as a Recurrent Neural Network. Specifically it formulates mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks." ; skos:prefLabel "CRF-RNN" . :CRISS a skos:Concept ; dcterms:source ; skos:definition "**CRISS**, or **Cross-lingual Retrievial for Iterative Self-Supervised Training (CRISS)**, is a self-supervised learning method for multilingual sequence generation. CRISS is developed based on the finding that the encoder outputs of multilingual denoising autoencoder can be used as language agnostic representation to retrieve parallel sentence pairs, and training the model on these retrieved sentence pairs can further improve its sentence retrieval and translation capabilities in an iterative manner. Using only unlabeled data from many different languages, CRISS iteratively mines for parallel sentences across languages, trains a new better multilingual model using these mined sentence pairs, mines again for better parallel sentences, and repeats." ; skos:prefLabel "CRISS" . :CRN a skos:Concept ; dcterms:source ; skos:altLabel "Conditional Relation Network" ; skos:definition "**Conditional Relation Network**, or **CRN**, is a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning." ; skos:prefLabel "CRN" . :CReLU a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**CReLU**, or **Concatenated Rectified Linear Units**, is a type of activation function which preserves both positive and negative phase information while enforcing non-saturated non-linearity. We compute by concatenating the layer output $h$ as:\r \r $$ \\left[\\text{ReLU}\\left(h\\right), \\text{ReLU}\\left(-h\\right)\\right] $$""" ; skos:prefLabel "CReLU" . :CS-GAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CS-GAN** is a type of generative adversarial network that uses a form of deep compressed sensing, and [latent optimisation](https://paperswithcode.com/method/latent-optimisation), to improve the quality of generated samples." ; skos:prefLabel "CS-GAN" . :CSGLD a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Contour Stochastic Gradient Langevin Dynamics" ; skos:definition "Simulations of multi-modal distributions can be very costly and often lead to unreliable predictions. To accelerate the computations, we propose to sample from a flattened distribution to accelerate the computations and estimate the importance weights between the original distribution and the flattened distribution to ensure the correctness of the distribution." ; skos:prefLabel "CSGLD" . :CSL a skos:Concept ; dcterms:source ; skos:altLabel "Circular Smooth Label" ; skos:definition "**Circular Smooth Label** (CSL) is a classification-based rotation detection technique for arbitrary-oriented object detection. It is used for circularly distributed angle classification and addresses the periodicity of the angle and increases the error tolerance to adjacent angles." ; skos:prefLabel "CSL" . :CSPDarknet53 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**CSPDarknet53** is a convolutional neural network and backbone for object detection that uses [DarkNet-53](https://paperswithcode.com/method/darknet-53). It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. \r \r This CNN is used as the backbone for [YOLOv4](https://paperswithcode.com/method/yolov4).""" ; skos:prefLabel "CSPDarknet53" . :CSPDenseNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CSPDenseNet** is a convolutional neural network and object detection backbone where we apply the Cross Stage Partial Network (CSPNet) approach to [DenseNet](https://paperswithcode.com/method/densenet). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network." ; skos:prefLabel "CSPDenseNet" . :CSPDenseNet-Elastic a skos:Concept ; dcterms:source ; skos:definition "**CSPDenseNet-Elastic** is a convolutional neural network and object detection backbone where we apply the Cross Stage Partial Network (CSPNet) approach to [DenseNet-Elastic](https://paperswithcode.com/method/densenet-elastic). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network." ; skos:prefLabel "CSPDenseNet-Elastic" . :CSPPeleeNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CSPPeleeNet** is a convolutional neural network and object detection backbone where we apply the Cross Stage Partial Network (CSPNet) approach to [PeleeNet](https://paperswithcode.com/method/peleenet). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network." ; skos:prefLabel "CSPPeleeNet" . :CSPResNeXt a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CSPResNeXt** is a convolutional neural network where we apply the Cross Stage Partial Network (CSPNet) approach to [ResNeXt](https://paperswithcode.com/method/resnext). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network." ; skos:prefLabel "CSPResNeXt" . :CSPResNeXtBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CSPResNeXt Block** is an extended [ResNext Block](https://paperswithcode.com/method/resnext-block) where we partition the feature map of the base layer into two parts and then merge them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network." ; skos:prefLabel "CSPResNeXt Block" . :CT-Layer a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Commute Times Layer" ; skos:definition """**TL;DR: CT-Layer is a GNN Layer which is able to rewire a graph in an inductive an parameter-free way according to the commute times distance (or effective resistance). We address it learning a differentiable way to compute the CT-embedding of the graph.**\r \r ### Summary\r \r **CT-Layer** is able to Learn the *Commute Times distance* between nodes (i.e. *effective resistance distance*) in a **differentiable** way, instead of the common spectral version, and in a **parameter free** manner, which is not the cased of the heat kernel. This approach allow to solve it as an optimization problem inside a GNN, leading to have a new layer which is able to learn how rewire a given graph in an optimal, and **inductive** way. \r \r In addition, **CT-Layer** also is able to learn *Commute Times embeddings*, and then calculate it for any graph in an inductive way. The Commute Times embedding is also related with the *eigenvalues* and *eigenvectors* of the Laplacian of the graph, because CT embedding is just the eigenvectors scaled. Therefore, CT-Layer is also able to learn hot to calculate the spectrum of the Laplacian in a differentiable way. Therefore, this embedding must satisfy orthogonality and normality.\r \r Finally, recent connections has been found between commute times distance and **curvature** (which is non-differentiable), establishing equivalences between them. Therefore, **CT-Layer** can also be seen as the differentiable version of the curvature rewiring.\r \r **We are going through a quick overview of the layer, but I suggest go to the paper for a detailed explanation. **\r \r ### Spectral CT- Embedding downsides\r CT-embedding $\\mathbf{Z}$ is computed spectrally in the literature (until the proposal of this method) or it is approximated using the heat kernel (very dependent on hyperparameter $t$). This fact does not allow us to propose differentiable methods using that measure:\r $$\r \\mathbf{Z}=\\sqrt{vol(G)}\\mathbf{\\Lambda}^\\frac{1}{2}\\mathbf{F}^T \\textrm{ given } \\mathbf{L}=\\mathbf{F}\\mathbf{\\Lambda}\\mathbf{F}^T\r $$\r \r Then, CT-distance is given by the Euclidean distances between the embeddings $CT_{uv} = ||\\mathbf{z_u}-\\mathbf{z_v}||^2$. The spectral form is: \r \r $$\r \\frac{CT_{uv}}{vol(G)} = \\sum_{i=2}^n \\frac{1}{\\lambda_i} (\\mathbf{f}(u)-\\mathbf{f}(v))^2 \r $$\r being $\\mathbf{f}$ the eigenvectors of the graph Laplacian. \r \r This embedding and distances gives us desirable properties of the graph, such an understanding of the structure, or an embedding based on the spectrum which minimizes Dirichlet energies. However, **the spectral computation is not differentiable**.\r \r ### CT-Layer as an optimization problem: Differentiable, learnable and inductive CT-Layer\r Giving that $\\mathbf{Z}$ minimizes Dirichlet energies s.t. being orthogonal and normalized, we can formulate this problem as constraining neighboring nodes to have a similar embeddings s.t. $\\mathbf{Z}\\mathbf{Z}^T=\\mathbf{I}$.\r \r $$\r \\mathbf{Z} = \\arg\\min_{\\mathbf{Z}^T\\mathbf{Z}=\\mathbf{I}} \\frac{\\sum\\_{u,v} ||\\mathbf{z_u}-\\mathbf{z_v}||^2\\mathbf{A}\\_{uv}}{\\sum\\_{u,v} \\mathbf{Z}^2\\_{uv} d_u}=\\frac{Tr[\\mathbf{Z}^T\\mathbf{L}\\mathbf{Z}]}{Tr[\\mathbf{Z}^T\\mathbf{D}\\mathbf{Z}]}\r $$\r \r With the above elements we have a definition of **CT-Layer**, our rewiring layer: \r Given the matrix $\\mathbf{X}\\_{n\\times F}$ encoding the features of the nodes after any message passing (MP) layer, $\\mathbf{Z}\\_{n\\times O(n)}=\\tanh(\\textrm{MLP}(\\mathbf{X}))$ learns the association $\\mathbf{X}\\rightarrow \\mathbf{Z}$ while $\\mathbf{Z}$ is optimized according to the loss \r $$\r L\\_{CT} = \\frac{Tr[\\mathbf{Z}^T\\mathbf{L}\\mathbf{Z}]}{Tr[\\mathbf{Z}^T\\mathbf{D}\\mathbf{Z}]} + \\left\\|\\frac{\\mathbf{Z}^T\\mathbf{Z}}{\\|\\mathbf{Z}^T\\mathbf{Z}\\|\\_F} - \\mathbf{I}\\_n\\right\\|\\_F\r $$\r This results in the following **resistance diffusion** $\\mathbf{T}^{CT} = \\mathbf{R}(\\mathbf{S})\\odot \\mathbf{A}$ (Hadamard product between the resistance distance and the adjacency) which provides as input to the subsequent MP layer a learnt convolution matrix.\r \r As explained before, $\\mathbf{Z}$ is the **commute times embedding matrix** and the pairwise euclidian distance of that learned embeddings are the **commute times distances** or resistance distances. **Therefore, once trained this layer, it will be able to calculate the commute times embedding for a new graph, and rewire that new and unseen graph in a principled way based on the commute times distance.**\r \r ## Preservation of Structure\r Does this rewiring preserve the original structure? Let $G' = \\textrm{Sparsify}(G, q)$ be a sampling algorithm of graph $G = (V, E)$, where edges $e \\in E$ are sampled with probability $q\\propto R_e$ (**proportional to the effective resistance, i.e. commute times**).\r Then, for $n = |V|$ sufficiently large and $1/\\sqrt{n}< \\epsilon\\le 1$, we need O(n\\log n/\\epsilon^2)$ samples to satisfy:\r \r $$\r \\forall \\mathbf{x}\\in\\mathbb{R}^n:\\; (1-\\epsilon)\\mathbf{x}^T\\mathbf{L}\\_G\\mathbf{x}\\le\\mathbf{x}^T\\mathbf{L}\\_{G'}\\mathbf{x}\\le (1+\\epsilon)\\mathbf{x}^T\\mathbf{L}\\_G\\mathbf{x}\r $$\r \r The intuitions behind is that Dirichlet energies in $G'$ are bounded in $(1\\pm \\epsilon)$ of the Dirichlet energies of the original graph $G$.""" ; skos:prefLabel "CT-Layer" . :CT3D a skos:Concept ; dcterms:source ; skos:definition """**CT3D** is a two-stage 3D object detection framework that leverages a high-quality region proposal network and a Channel-wise [Transformer](https://paperswithcode.com/method/transformer) architecture. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses a proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. \r \r In CT3D, the raw points are first fed into the [RPN](https://paperswithcode.com/method/rpn) for generating 3D proposals. Then the raw points along with the corresponding proposals are processed by the channel-wise Transformer composed of the proposal-to-point encoding module and the channel-wise decoding module. Specifically, the proposal-to-point encoding module is to modulate each point feature with global proposal-aware context information. After that, the encoded point features are transformed into an effective proposal feature representation by the\r channel-wise decoding module for confidence prediction and box regression.""" ; skos:prefLabel "CT3D" . :CTAB-GAN a skos:Concept ; dcterms:source ; skos:definition "**CTAB-GAN** is a model for conditional tabular data generation. The generator and discriminator utilize the [DCGAN](https://paperswithcode.com/method/dcgan) architecture. An [auxiliary classifier](https://paperswithcode.com/method/auxiliary-classifier) is also used with an MLP architecture." ; skos:prefLabel "CTAB-GAN" . :CTAL a skos:Concept ; dcterms:source ; skos:definition "**CTAL** is a pre-training framework for strong audio-and-language representations with a [Transformer](https://paperswithcode.com/method/transformer), which aims to learn the intra-modality and inter-modalities connections between audio and language through two proxy tasks on a large amount of audio- and-language pairs: masked language modeling and masked cross-modal acoustic modeling. The pre-trained model is a Transformer for Audio and Language, i.e., CTAL, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoder module which accepts both frame-level Mel-spectrograms and token-level output embeddings from the language stream" ; skos:prefLabel "CTAL" . :CTCLoss a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Connectionist Temporal Classification Loss" ; skos:definition "A **Connectionist Temporal Classification Loss**, or **CTC Loss**, is designed for tasks where we need alignment between sequences, but where that alignment is difficult - e.g. aligning each character to its location in an audio file. It calculates a loss between a continuous (unsegmented) time series and a target sequence. It does this by summing over the probability of possible alignments of input to target, producing a loss value which is differentiable with respect to each input node. The alignment of input to target is assumed to be “many-to-one”, which limits the length of the target sequence such that it must be $\\leq$ the input length." ; skos:prefLabel "CTC Loss" . :CTRL a skos:Concept ; dcterms:source ; skos:definition """**CTRL** is conditional [transformer](https://paperswithcode.com/method/transformer) language model, trained\r to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw\r text, preserving the advantages of unsupervised learning while providing more\r explicit control over text generation. These codes also allow CTRL to predict\r which parts of the training data are most likely given a sequence""" ; skos:prefLabel "CTRL" . :CTracker a skos:Concept ; dcterms:source ; skos:altLabel "Chained-Tracker" ; skos:definition """**Chained-Tracker**, or **CTracker**, is an online model for multiple-object tracking. It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module).\r \r The joint attention module guides the paired boxes regression branch to focus on informative spatial regions with two other branches. One is the object classification branch, which predicts the confidence scores for the first box in the detected box pairs, and such scores are used to guide the regression branch to focus on the foreground regions. The other one is the ID verification branch whose prediction facilitates the regression branch to focus on regions corresponding to the same target. Finally, the bounding box pairs are filtered according to the classification confidence. Then, the generated box pairs belonging to the adjacent frame pairs could be associated using simple methods like IoU (Intersection over Union) matching according to their boxes in the common frame. In this way, the tracking process could be achieved by chaining all the adjacent frame pairs (i.e. chain nodes) sequentially.""" ; skos:prefLabel "CTracker" . :CV-MIM a skos:Concept ; dcterms:source ; skos:altLabel "Contrastive Cross-View Mutual Information Maximization" ; skos:definition "**CV-MIM**, or **Contrastive Cross-View Mutual Information Maximization**, is a representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization, which maximizes mutual information of the same pose performed from different viewpoints in a contrastive learning manner. It further utilizes two regularization terms to ensure disentanglement and smoothness of the learned representations." ; skos:prefLabel "CV-MIM" . :CVRL a skos:Concept ; dcterms:source ; skos:altLabel "Contrastive Video Representation Learning" ; skos:definition """**Contrastive Video Representation Learning**, or **CVRL**, is a self-supervised contrastive learning framework for learning spatiotemporal visual representations from unlabeled videos. Representations are learned using a contrastive loss, where two clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. Data augmentations are designed involving spatial and temporal cues. Concretely, a [temporally consistent spatial augmentation](https://paperswithcode.com/method/temporally-consistent-spatial-augmentation#) method is used to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. A sampling-based temporal augmentation method is also used to avoid overly enforcing invariance on clips that are distant in time. \r \r End-to-end, from a raw video, we first sample a temporal interval from a monotonically decreasing distribution. The temporal interval represents the number of frames between the start points of two clips, and we sample two clips from a video according to this interval. Afterwards we apply a [temporally consistent spatial augmentation](https://paperswithcode.com/method/temporally-consistent-spatial-augmentation) to each of the clips and feed them into a 3D backbone with an MLP head. The contrastive loss is used to train the network to attract the clips from the same video and repel the clips from different videos in the embedding space.""" ; skos:prefLabel "CVRL" . :CW-ERM a skos:Concept ; dcterms:source ; skos:altLabel "Closed-loop Weighted Empirical Risk Minimization" ; skos:definition "A closed-loop evaluation procedure is first used in a simulator to identify training data samples that are important for practical driving performance and then we these samples to help debias the policy network." ; skos:prefLabel "CW-ERM" . :CaiT a skos:Concept ; dcterms:source ; skos:altLabel "Class-Attention in Image Transformers" ; skos:definition "**CaiT**, or **Class-Attention in Image Transformers**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) with several design alterations upon the original [ViT](https://paperswithcode.com/method/vision-transformer). First a new layer scaling approach called [LayerScale](https://paperswithcode.com/method/layerscale) is used, adding a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0, which improves the training dynamics. Secondly, [class-attention layers](https://paperswithcode.com/method/ca) are introduced to the architecture. This creates an architecture where the transformer layers involving [self-attention](https://paperswithcode.com/method/scaled) between patches are explicitly separated from class-attention layers -- that are devoted to extract the content of the processed patches into a single vector so that it can be fed to a linear classifier." ; skos:prefLabel "CaiT" . :CanvasMethod a skos:Concept ; dcterms:source ; skos:definition "**Canvas Method** is a method for inference attacks on object detection models. It draws a predicted bounding box distribution on an empty canvas for an attack model input. The canvas is initially set to an image of 300$\\times$300 pixels in size, where every pixel has a value of zero and the boxes drawn on the canvas have the same center as the predicted boxes and the same intensity as the prediction scores." ; skos:prefLabel "Canvas Method" . :CapsNet a skos:Concept ; dcterms:source ; skos:altLabel "Capsule Network" ; skos:definition "**Capsule Network** is a machine learning system that is a type of artificial neural network that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization." ; skos:prefLabel "CapsNet" . :CapsuleNetwork a skos:Concept ; dcterms:source ; skos:definition """A capsule is an activation vector that basically executes on its inputs some complex internal\r computations. Length of these activation vectors signifies the\r probability of availability of a feature. Furthermore, the condition\r of the recognized element is encoded as the direction in which\r the vector is pointing. In traditional, CNN uses Max pooling for\r invariance activities of neurons, which is nothing except a minor\r change in input and the neurons of output signal will remains\r same.""" ; skos:prefLabel "Capsule Network" . :CascadeCornerPooling a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Cascade Corner Pooling** is a pooling layer for object detection that builds upon the [corner pooling](https://paperswithcode.com/method/corner-pooling) operation. Corners are often outside the objects, which lacks local appearance features. [CornerNet](https://paperswithcode.com/method/cornernet) uses corner pooling to address this issue, where we find the maximum values on the boundary directions so as to determine corners. However, it makes corners sensitive to the edges. To address this problem, we need to let corners see the visual patterns of objects. Cascade corner pooling first looks along a boundary to find a boundary maximum value, then looks inside along the location of the boundary maximum value to find an internal maximum value, and finally, add the two maximum values together. By doing this, the corners obtain both the the boundary information and the visual patterns of objects." ; skos:prefLabel "Cascade Corner Pooling" . :CascadeMaskR-CNN a skos:Concept ; dcterms:source ; skos:definition """**Cascade Mask R-CNN** extends [Cascade R-CNN](https://paperswithcode.com/method/cascade-r-cnn) to instance segmentation, by adding a\r mask head to the cascade.\r \r In the [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn), the segmentation branch is inserted in parallel to the detection branch. However, the Cascade [R-CNN](https://paperswithcode.com/method/r-cnn) has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each\r cascade stage. This maximizes the diversity of samples used to learn the mask prediction task. \r \r At inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.""" ; skos:prefLabel "Cascade Mask R-CNN" . :CascadePSP a skos:Concept ; dcterms:source ; skos:definition "**CascadePSP** is a general segmentation refinement model that refines any given segmentation from low to high resolution. The model takes as input an initial mask that can be an output of any algorithm to provide a rough object location. Then the CascadePSP will output a refined mask. The model is designed in a cascade fashion that generates refined segmentation in a coarse-to-fine manner. Coarse outputs from the early levels predict object structure which will be used as input to the latter levels to refine boundary details." ; skos:prefLabel "CascadePSP" . :CascadeR-CNN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Cascade R-CNN** is an object detection architecture that seeks to address problems with degrading performance with increased IoU thresholds (due to overfitting during training and inference-time mismatch between IoUs for which detector is optimal and the inputs). It is a multi-stage extension of the [R-CNN](https://paperswithcode.com/method/r-cnn), where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. \r \r Cascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage.""" ; skos:prefLabel "Cascade R-CNN" . :CategoricalModularity a skos:Concept ; dcterms:source ; skos:definition """A novel low-resource intrinsic metric to evaluate word\r embedding quality based on graph modularity.""" ; skos:prefLabel "Categorical Modularity" . :CausalConvolution a skos:Concept ; dcterms:source ; skos:definition "**Causal convolutions** are a type of [convolution](https://paperswithcode.com/method/convolution) used for temporal data which ensures the model cannot violate the ordering in which we model the data: the prediction $p(x_{t+1} | x_{1}, \\ldots, x_{t})$ emitted by the model at timestep $t$ cannot depend on any of the future timesteps $x_{t+1}, x_{t+2}, \\ldots, x_{T}$. For images, the equivalent of a causal convolution is a [masked convolution](https://paperswithcode.com/method/masked-convolution) which can be implemented by constructing a mask tensor and doing an element-wise multiplication of this mask with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps." ; skos:prefLabel "Causal Convolution" . :CausalInference a skos:Concept ; skos:definition "Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed." ; skos:prefLabel "Causal Inference" . :CayleyNet a skos:Concept ; dcterms:source ; skos:definition """The core ingredient of **CayleyNet** is a new class of parametric rational complex functions (Cayley polynomials) allowing to efficiently compute spectral filters on graphs that specialize on frequency bands of interest. The model generates rich spectral filters that are localized in space, scales linearly with the size of the input data for sparsely-connected graphs, and can handle different constructions of Laplacian operators.\r \r Description adapted from: [CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters](https://arxiv.org/pdf/1705.07664.pdf)""" ; skos:prefLabel "CayleyNet" . :CeiT a skos:Concept ; dcterms:source ; skos:altLabel "Convolution-enhanced image Transformer" ; skos:definition "**Convolution-enhanced image Transformer** (**CeiT**) combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an **Image-to-Tokens** (**I2T**) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a **Locally-enhanced Feed-Forward** (**LeFF**) layer that promotes the correlation among neighbouring tokens in the spatial dimension; 3) a **Layer-wise Class token Attention** (**LCA**) is attached at the top of the Transformer that utilizes the multi-level representations." ; skos:prefLabel "CeiT" . :CenterMask a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CenterMask** is an anchor-free instance segmentation method that adds a novel [spatial attention-guided mask](https://paperswithcode.com/method/spatial-attention-guided-mask) (SAG-Mask) branch to anchor-free one stage object detector ([FCOS](https://paperswithcode.com/method/fcos)) in the same vein with [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn). Plugged into the FCOS object detector, the SAG-Mask branch predicts a segmentation mask on each detected box with the spatial attention map that helps to focus on informative pixels and suppress noise." ; skos:prefLabel "CenterMask" . :CenterNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CenterNet** is a one-stage object detector that detects each object as a triplet, rather than a pair, of keypoints. It utilizes two customized modules named [cascade corner pooling](https://paperswithcode.com/method/cascade-corner-pooling) and [center pooling](https://paperswithcode.com/method/center-pooling), which play the roles of enriching information collected by both top-left and bottom-right corners and providing more recognizable information at the central regions, respectively. The intuition is that, if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in its central region is predicted as the same class is high, and vice versa. Thus, during inference, after a proposal is generated as a pair of corner keypoints, we determine if the proposal is indeed an object by checking if there is a center keypoint of the same class falling within its central region." ; skos:prefLabel "CenterNet" . :CenterPoint a skos:Concept ; dcterms:source ; skos:definition "**CenterPoint** is a two-stage 3D detector that finds centers of objects and their properties using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation and velocity. In a second-stage, it refines these estimates using additional point features on the object. CenterPoint uses a standard Lidar-based backbone network, i.e., VoxelNet or PointPillars, to build a representation of the input point-cloud. CenterPoint predicts the relative offset (velocity) of objects between consecutive frames, which are then linked up greedily -- so in Centerpoint, 3D object tracking simplifies to greedy closest-point matching." ; skos:prefLabel "CenterPoint" . :CenterPooling a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Center Pooling** is a pooling technique for object detection that aims to capture richer and more recognizable visual patterns. The geometric centers of objects do not necessarily convey very recognizable visual patterns (e.g., the human head contains strong visual patterns, but the center keypoint is often in the middle of the human body). \r \r The detailed process of center pooling is as follows: the backbone outputs a feature map, and to determine if a pixel in the feature map is a center keypoint, we need to find the maximum value in its both horizontal and vertical directions and add them together. By doing this, center pooling helps the better detection of center keypoints.""" ; skos:prefLabel "Center Pooling" . :CenterTrack a skos:Concept ; dcterms:source ; skos:altLabel "Track objects as points" ; skos:definition "Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time." ; skos:prefLabel "CenterTrack" . :CentripetalNet a skos:Concept ; dcterms:source ; skos:definition "**CentripetalNet** is a keypoint-based detector which uses centripetal shift to pair corner keypoints from the same instance. CentripetalNet predicts the position and the centripetal shift of the corner points and matches corners whose shifted results are aligned." ; skos:prefLabel "CentripetalNet" . a skos:Concept ; dcterms:source ; skos:definition "Channel & spatial attention combines the advantages of channel attention and spatial attention. It adaptively selects both important objects and regions" ; skos:prefLabel "Channel & Spatial attention" . :Channel-wiseCrossAttention a skos:Concept ; dcterms:source ; skos:definition """**Channel-wise Cross Attention** is a module for semantic segmentation used in the [UCTransNet](https://paperswithcode.com/method/uctransnet) architecture. It is used to fuse features of inconsistent semantics between the Channel [Transformer](https://paperswithcode.com/method/transformer) and [U-Net](https://paperswithcode.com/method/u-net) decoder. It guides the channel and information filtration of the Transformer features and eliminates the ambiguity with the decoder features.\r \r Mathematically, we take the $i$-th level Transformer output $\\mathbf{O\\_{i}} \\in \\mathbb{R}^{C×H×W}$ and i-th level decoder feature map $\\mathbf{D\\_{i}} \\in \\mathbb{R}^{C×H×W}$ as the inputs of Channel-wise Cross Attention. Spatial squeeze is performed by a [global average pooling](https://paperswithcode.com/method/global-average-pooling) (GAP) layer, producing vector $\\mathcal{G}\\left(\\mathbf{X}\\right) \\in \\mathbb{R}^{C×1×1}$ with its $k$th channel $\\mathcal{G}\\left(\\mathbf{X}\\right) = \\frac{1}{H×W}\\sum^{H}\\_{i=1}\\sum^{W}\\_{j=1}\\mathbf{X}^{k}\\left(i, j\\right)$. We use this operation to embed the global spatial information and then generate the attention mask:\r \r $$ \\mathbf{M}\\_{i} = \\mathbf{L}\\_{1} \\cdot \\mathcal{G}\\left(\\mathbf{O\\_{i}}\\right) + \\mathbf{L}\\_{2} \\cdot \\mathcal{G}\\left(\\mathbf{D}\\_{i}\\right) $$\r \r where $\\mathbf{L}\\_{1} \\in \\mathbb{R}^{C×C}$ and $\\mathbf{L}\\_{2} \\in \\mathbb{R}^{C×C}$ and being weights of two Linear layers and the [ReLU](https://paperswithcode.com/method/relu) operator $\\delta\\left(\\cdot\\right)$. This operation in the equation above encodes the channel-wise dependencies. Following [ECA-Net](https://paperswithcode.com/method/eca-net) which empirically showed avoiding dimensionality reduction is important for learning channel attention, the authors use a single [Linear layer](https://paperswithcode.com/method/linear-layer) and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite $\\mathbf{O\\_{i}}$ to $\\mathbf{\\bar{O}\\_{i}} = \\sigma\\left(\\mathbf{M\\_{i}}\\right) \\cdot \\mathbf{O\\_{i}}$, where the activation $\\sigma\\left(\\mathbf{M\\_{i}}\\right)$ indicates the importance of each channel. Finally, the masked $\\mathbf{\\bar{O}}\\_{i}$ is concatenated with the up-sampled features of the $i$-th level decoder.""" ; skos:prefLabel "Channel-wise Cross Attention" . :Channel-wiseCrossFusionTransformer a skos:Concept ; dcterms:source ; skos:definition "**Channel-wise Cross Fusion Transformer** is a module used in the [UCTransNet](https://paperswithcode.com/method/uctransnet) architecture for semantic segmentation. It fuses the multi-scale encoder features with the advantage of the long dependency modeling in the [Transformer](https://paperswithcode.com/method/transformer). The [CCT](https://paperswithcode.com/method/cct) module consists of three steps: multi-scale feature embedding, multi-head [channel-wise cross attention](https://paperswithcode.com/method/channel-wise-cross-attention) and Multi-Layer Perceptron (MLP)." ; skos:prefLabel "Channel-wise Cross Fusion Transformer" . :Channel-wiseSoftAttention a skos:Concept ; skos:definition """**Channel-wise Soft Attention** is an attention mechanism in computer vision that assigns "soft" attention weights for each channel $c$. In soft channel-wise attention, the alignment weights are learned and placed "softly" over each channel. This would contrast with hard attention which would only selects one channel to attend to at a time.\r \r Image: [Xu et al](http://proceedings.mlr.press/v37/xuc15.pdf)""" ; skos:prefLabel "Channel-wise Soft Attention" . :ChannelAttentionModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Channel Attention Module** is a module for channel-based attention in convolutional neural networks. We produce a channel attention map by exploiting the inter-channel relationship of features. As each channel of a feature map is considered as a feature detector, channel attention focuses on ‘what’ is meaningful given an input image. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map. \r \r We first aggregate spatial information of a feature map by using both average-pooling and max-pooling operations, generating two different spatial context descriptors: $\\mathbf{F}^{c}\\_{avg}$ and $\\mathbf{F}^{c}\\_{max}$, which denote average-pooled features and max-pooled features respectively. \r \r Both descriptors are then forwarded to a shared network to produce our channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C\\times{1}\\times{1}}$. Here $C$ is the number of channels. The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to $\\mathbb{R}^{C/r×1×1}$, where $r$ is the reduction ratio. After the shared network is applied to each descriptor, we merge the output feature vectors using element-wise summation. In short, the channel attention is computed as:\r \r $$ \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\text{MLP}\\left(\\text{AvgPool}\\left(\\mathbf{F}\\right)\\right)+\\text{MLP}\\left(\\text{MaxPool}\\left(\\mathbf{F}\\right)\\right)\\right) $$\r \r $$ \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{avg}\\right)\\right) +\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{max}\\right)\\right)\\right) $$\r \r where $\\sigma$ denotes the sigmoid function, $\\mathbf{W}\\_{0} \\in \\mathbb{R}^{C/r\\times{C}}$, and $\\mathbf{W}\\_{1} \\in \\mathbb{R}^{C\\times{C/r}}$. Note that the MLP weights, $\\mathbf{W}\\_{0}$ and $\\mathbf{W}\\_{1}$, are shared for both inputs and the [ReLU](https://paperswithcode.com/method/relu) activation function is followed by $\\mathbf{W}\\_{0}$.\r \r Note that the channel attention module with just [average pooling](https://paperswithcode.com/method/average-pooling) is the same as the [Squeeze-and-Excitation Module](https://paperswithcode.com/method/squeeze-and-excitation-block).""" ; skos:prefLabel "Channel Attention Module" . :ChannelShuffle a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Channel Shuffle** is an operation to help information flow across feature channels in convolutional neural networks. It was used as part of the [ShuffleNet](https://paperswithcode.com/method/shufflenet) architecture. \r \r If we allow a group [convolution](https://paperswithcode.com/method/convolution) to obtain input data from different groups, the input and output channels will be fully related. Specifically, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups, then feed each group in the next layer with different subgroups. \r \r The above can be efficiently and elegantly implemented by a channel shuffle operation: suppose a convolutional layer with $g$ groups whose output has $g \\times n$ channels; we first reshape the output channel dimension into $\\left(g, n\\right)$, transposing and then flattening it back as the input of next layer. Channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training.""" ; skos:prefLabel "Channel Shuffle" . :ChannelSqueezeandSpatialExcitation a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Channel Squeeze and Spatial Excitation (sSE)" ; skos:definition "Inspired on the widely known [spatial squeeze and channel excitation (SE)](https://paperswithcode.com/method/squeeze-and-excitation-block) block, the sSE block performs channel squeeze and spatial excitation, to recalibrate the feature maps spatially and achieve more fine-grained image segmentation." ; skos:prefLabel "Channel Squeeze and Spatial Excitation" . :Channelattention a skos:Concept ; dcterms:source ; skos:altLabel "squeeze-and-excitation networks" ; skos:definition """SENet pioneered channel attention. The core of SENet is a squeeze-and-excitation (SE) block which is used to collect global information, capture channel-wise relationships and improve representation ability.\r SE blocks are divided into two parts, a squeeze module and an excitation module. Global spatial information is collected in the squeeze module by global average pooling. The excitation module captures channel-wise relationships and outputs an attention vector by using fully-connected layers and non-linear layers (ReLU and sigmoid). Then, each channel of the input feature is scaled by multiplying the corresponding element in the attention vector. Overall, a squeeze-and-excitation block $F_\\text{se}$ (with parameter $\\theta$) which takes $X$ as input and outputs $Y$ can be formulated \r as:\r \\begin{align}\r s = F_\\text{se}(X, \\theta) & = \\sigma (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r \\end{align}\r \\begin{align}\r Y = sX\r \\end{align}""" ; skos:prefLabel "Channel attention" . :CharacterBERT a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "CharacterBERT is a variant of [BERT](https://paperswithcode.com/method/bert) that **drops the wordpiece system** and **replaces it with a CharacterCNN module** just like the one [ELMo](https://paperswithcode.com/method/elmo) uses to produce its first layer representation. This allows CharacterBERT to represent any input token without splitting it into wordpieces. Moreover, this frees BERT from the burden of a domain-specific wordpiece vocabulary which may not be suited to your domain of interest (e.g. medical domain). Finally, it allows the model to be more robust to noisy inputs." ; skos:prefLabel "CharacterBERT" . :CharacteristicFunctions a skos:Concept ; dcterms:source ; skos:altLabel "Characteristic Function Estimation for Discrete Probability Distributions" ; skos:definition "" ; skos:prefLabel "Characteristic Functions" . :Charformer a skos:Concept ; dcterms:source ; skos:definition "**Charformer** is a type of [Transformer](https://paperswithcode.com/methods/category/transformers) model that learns a subword tokenization end-to-end as part of the model. Specifically it uses [GBST](https://paperswithcode.com/method/gradient-based-subword-tokenization) that automatically learns latent subword representations from characters in a data-driven fashion. Following GBST, the soft subword sequence is passed through [Transformer](https://paperswithcode.com/method/transformer) layers." ; skos:prefLabel "Charformer" . :CheXNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CheXNet** is a 121-layer [DenseNet](https://paperswithcode.com/method/densenet) trained on ChestX-ray14 for pneumonia detection." ; skos:prefLabel "CheXNet" . :ChebNet a skos:Concept ; dcterms:source ; skos:definition """ChebNet involves a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs.\r \r Description from: [Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering](https://arxiv.org/pdf/1606.09375.pdf)""" ; skos:prefLabel "ChebNet" . :Child-Tuning a skos:Concept ; dcterms:source ; skos:definition "**Child-Tuning** is a fine-tuning technique that updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. It decreases the hypothesis space of the model via a task-specific mask applied to the full gradients, helping to effectively adapt the large-scale pretrained model to various tasks and meanwhile aiming to maintain its original generalization ability." ; skos:prefLabel "Child-Tuning" . :Chimera a skos:Concept ; dcterms:source ; skos:definition """**Chimera** is a pipeline model parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. The key idea of Chimera is to combine two pipelines in different directions (down and up pipelines). \r \r Denote $N$ as the number of micro-batches executed by each worker within a training iteration, and $D$ the number of pipeline stages (depth), and $P$ the number of workers.\r \r The Figure shows an example with four pipeline stages (i.e. $D=4$). Here we assume there are $D$ micro-batches executed by each worker within a training iteration, namely $N=D$, which is the minimum to keep all the stages active. \r \r In the down pipeline, stage$\\_{0}$∼stage$\\_{3}$ are mapped to $P\\_{0}∼P\\_{3}$ linearly, while in the up pipeline the stages are mapped in a completely opposite order. The $N$ (assuming an even number) micro-batches are equally partitioned among the two pipelines. Each pipeline schedules $N/2$ micro-batches using 1F1B strategy, as shown in the left part of the Figure. Then, by merging these two pipelines together, we obtain the pipeline schedule of Chimera. Given an even number of stages $D$ (which can be easily satisfied in practice), it is guaranteed that there is no conflict (i.e., there is at most one micro-batch occupies the same time slot on each worker) during merging.""" ; skos:prefLabel "Chimera" . :Chinchilla a skos:Concept ; dcterms:source ; skos:definition "Chinchilla is a 70B parameters model trained as a compute-optimal model with 1.4 trillion tokens. Findings suggest that these types of models are trained optimally by equally scaling both model size and training tokens. It uses the same compute budget as Gopher but with 4x more training data. Chinchilla and Gopher are trained for the same number of FLOPs. It is trained using [MassiveText](/dataset/massivetext) using a slightly modified SentencePiece tokenizer. More architectural details in the paper." ; skos:prefLabel "Chinchilla" . :ChinesePre-trainedUnbalancedTransformer a skos:Concept ; dcterms:source ; skos:definition "**CPT**, or **Chinese Pre-trained Unbalanced Transformer**, is a pre-trained unbalanced [Transformer](https://paperswithcode.com/method/transformer) for Chinese natural language understanding (NLU) and natural language generation (NLG) tasks. CPT consists of three parts: a shared encoder, an understanding decoder, and a generation decoder. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model." ; skos:prefLabel "Chinese Pre-trained Unbalanced Transformer" . :ClariNet a skos:Concept ; dcterms:source ; skos:definition "**ClariNet** is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform [synthesizer](https://paperswithcode.com/method/synthesizer) (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the [WaveNet](https://paperswithcode.com/method/wavenet) module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3)." ; skos:prefLabel "ClariNet" . :Class-MLP a skos:Concept ; dcterms:source ; skos:definition "**Class-MLP** is an alternative to [average pooling](https://paperswithcode.com/method/average-pooling), which is an adaptation of the class-attention token introduced in [CaiT](https://paperswithcode.com/method/cait). In CaiT, this consists of two layers that have the same structure as the [transformer](https://paperswithcode.com/method/transformer), but in which only the class token is updated based on the frozen patch embeddings. In Class-MLP, the same approach is used, but after aggregating the patches with a [linear layer](https://paperswithcode.com/method/linear-layer), we replace the [attention-based interaction](https://paperswithcode.com/method/scaled) between the class and patch embeddings by simple linear layers, still keeping the patch embeddings frozen. This increases the performance, at the expense of adding some parameters and computational cost. This pooling variant is referred to as “class-MLP”, since the purpose of these few layers is to replace average pooling." ; skos:prefLabel "Class-MLP" . :ClassActivationGuidedAttentionMechanism a skos:Concept ; dcterms:source ; skos:altLabel "Class Activation Guided Attention Mechanism (CAGAM)" ; skos:definition "CAGAM is a form of spatial attention mechanism that propagates attention from a known to an unknown context features thereby enhancing the unknown context for relevant pattern discovery. Usually the known context feature is a class activation map ([CAM](https://paperswithcode.com/method/cam))." ; skos:prefLabel "Class Activation Guided Attention Mechanism" . :ClassAttention a skos:Concept ; dcterms:source ; skos:definition """A **Class Attention** layer, or **CA Layer**, is an [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) for [vision transformers](https://paperswithcode.com/methods/category/vision-transformer) used in [CaiT](https://paperswithcode.com/method/cait) that aims to extract information from a set of processed patches. It is identical to a [self-attention layer](https://paperswithcode.com/method/scaled), except that it relies on the attention between (i) the class embedding $x_{\\text {class }}$ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings $x_{\\text {patches }} .$ \r \r Considering a network with $h$ heads and $p$ patches, and denoting by $d$ the embedding size, the multi-head class-attention is parameterized with several projection matrices, $W_{q}, W_{k}, W_{v}, W_{o} \\in \\mathbf{R}^{d \\times d}$, and the corresponding biases $b_{q}, b_{k}, b_{v}, b_{o} \\in \\mathbf{R}^{d} .$ With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as $z=\\left[x_{\\text {class }}, x_{\\text {patches }}\\right]$. We then perform the projections:\r \r $$Q=W\\_{q} x\\_{\\text {class }}+b\\_{q}$$\r \r $$K=W\\_{k} z+b\\_{k}$$\r \r $$V=W\\_{v} z+b\\_{v}$$\r \r The class-attention weights are given by\r \r $$\r A=\\operatorname{Softmax}\\left(Q . K^{T} / \\sqrt{d / h}\\right)\r $$\r \r where $Q . K^{T} \\in \\mathbf{R}^{h \\times 1 \\times p}$. This attention is involved in the weighted sum $A \\times V$ to produce the residual output vector\r \r $$\r \\operatorname{out}\\_{\\mathrm{CA}}=W\\_{o} A V+b\\_{o}\r $$\r \r which is in turn added to $x\\_{\\text {class }}$ for subsequent processing.""" ; skos:prefLabel "Class Attention" . :ClassSR a skos:Concept ; dcterms:source ; skos:definition "**ClassSR** is a framework to accelerate super-resolution (SR) networks on large images (2K-8K). ClassSR combines classification and SR in a unified framework. In particular, it first uses a Class-Module to classify the sub-images into different classes according to restoration difficulties, then applies an SR-Module to perform SR for different classes. The Class-Module is a conventional classification network, while the SR-Module is a network container that consists of the to-be-accelerated SR network and its simplified versions." ; skos:prefLabel "ClassSR" . :ClipBERT a skos:Concept ; dcterms:source ; skos:definition """**ClipBERT** is a framework for end-to-end-learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Two aspects distinguish ClipBERT from previous work. \r \r First, in contrast to densely extracting video features (adopted by most existing methods), CLIPBERT sparsely samples only one single or a few short clips from the full-length videos at each training step. The hypothesis is that visual features from sparse clips already capture key visual and semantic information in the video, as consecutive clips usually contain similar semantics from a continuous scene. Thus, a handful of clips are sufficient for training, instead of using the full video. Then, predictions from multiple densely-sampled clips are aggregated to obtain the final video-level prediction during inference, which is less computational demanding. \r \r The second differentiating aspect concerns the initialization of model weights (i.e., transfer through pre-training). The authors use 2D architectures (e.g., [ResNet](https://paperswithcode.com/method/resnet)-50) instead of 3D features as the visual backbone for video encoding, allowing them to harness the power of image-text pretraining for video-text understanding along with the advantages of low memory cost and runtime efficiency.""" ; skos:prefLabel "ClipBERT" . :ClippedDoubleQ-learning a skos:Concept ; dcterms:source ; skos:definition """**Clipped Double Q-learning** is a variant on [Double Q-learning](https://paperswithcode.com/method/double-q-learning) that upper-bounds the less biased Q estimate $Q\\_{\\theta\\_{2}}$ by the biased estimate $Q\\_{\\theta\\_{1}}$. This is equivalent to taking the minimum of the two estimates, resulting in the following target update:\r \r $$ y\\_{1} = r + \\gamma\\min\\_{i=1,2}Q\\_{\\theta'\\_{i}}\\left(s', \\pi\\_{\\phi\\_{1}}\\left(s'\\right)\\right) $$\r \r The motivation for this extension is that vanilla double [Q-learning](https://paperswithcode.com/method/q-learning) is sometimes ineffective if the target and current networks are too similar, e.g. with a slow-changing policy in an actor-critic framework.""" ; skos:prefLabel "Clipped Double Q-learning" . :Cluster-GCN a skos:Concept ; dcterms:source ; skos:definition """Cluster-GCN is a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms.\r \r Description and image from: [Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks](https://arxiv.org/pdf/1905.07953.pdf)""" ; skos:prefLabel "Cluster-GCN" . :ClusterFit a skos:Concept ; dcterms:source ; skos:definition "**ClusterFit** is a self-supervision approach for learning image representations. Given a dataset, we (a) cluster its features extracted from a pre-trained network using k-means and (b) re-train a new network from scratch on this dataset using cluster assignments as pseudo-labels." ; skos:prefLabel "ClusterFit" . :Co-Correcting a skos:Concept ; dcterms:source ; skos:definition "**Co-Correcting** is a noise-tolerant deep learning framework for medical image classification based on mutual learning and annotation correction. It consists of three modules: the dual-network architecture, the curriculum learning module, and the label correction module." ; skos:prefLabel "Co-Correcting" . :CoBERL a skos:Concept ; dcterms:source ; skos:altLabel "Contrastive BERT" ; skos:definition """**Contrastive BERT** is a reinforcement learning agent that combines a new contrastive loss and a hybrid [LSTM](https://paperswithcode.com/method/lstm)-[transformer](https://paperswithcode.com/method/transformer) architecture to tackle the challenge of improving data efficiency for RL. It uses bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations.\r \r For the architecture, a residual network is used to encode observations into embeddings $Y\\_{t}$. $Y_{t}$ is fed through a causally masked [GTrXL transformer](https://www.paperswithcode.com/method/gtrxl), which computes the predicted masked inputs $X\\_{t}$ and passes those together with $Y\\_{t}$ to a learnt gate. The output of the gate is passed through a single [LSTM](https://www.paperswithcode.com/method/lstm) layer to produce the values that we use for computing the RL loss. A contrastive loss is computed using predicted masked inputs $X_{t}$ and $Y_{t}$ as targets. For this, we do not use the causal mask of the Transformer.""" ; skos:prefLabel "CoBERL" . :CoLU a skos:Concept ; dcterms:source ; skos:altLabel "Collapsing Linear Unit" ; skos:definition """CoLU is an activation function similar to Swish and Mish in properties. It is defined as:\r $$f(x)=\\frac{x}{1-x^{-(x+e^x)}}$$\r It is smooth, continuously differentiable, unbounded above, bounded below, non-saturating, and non-monotonic. Based on experiments done with CoLU with different activation functions, it is observed that CoLU usually performs better than other functions on deeper neural networks.""" ; skos:prefLabel "CoLU" . :CoOp a skos:Concept ; dcterms:source ; skos:altLabel "Context Optimization" ; skos:definition "**CoOp**, or **Context Optimization**, is an automated prompt engineering method that avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data. The context could be shared among all classes or designed to be class-specific. During training, we simply minimize the prediction error using the cross-entropy loss with respect to the learnable context vectors, while keeping the pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context." ; skos:prefLabel "CoOp" . :CoTPrompting a skos:Concept ; dcterms:source ; skos:altLabel "Chain-of-thought prompting" ; skos:definition "Chain-of-thought prompts contain a series of intermediate reasoning steps, and they are shown to significantly improve the ability of large language models to perform certain tasks that involve complex reasoning (e.g., arithmetic, commonsense reasoning, symbolic reasoning, etc.)" ; skos:prefLabel "CoT Prompting" . :CoVA a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Context-aware Visual Attention-based (CoVA) webpage object detection pipeline" ; skos:definition """Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (_CoVA_) aims to learn function _f_ to predict labels _y = [$y_1, y_2, ..., y_N$]_ for a webpage containing _N_ elements. The input to CoVA consists of:\r 1. a screenshot of a webpage,\r 2. list of bounding boxes _[x, y, w, h]_ of the web elements, and\r 3. neighborhood information for each element obtained from the DOM tree.\r \r This information is processed in four stages:\r 1. the graph representation extraction for the webpage,\r 2. the Representation Network (_RN_),\r 3. the Graph Attention Network (_GAT_), and\r 4. a fully connected (_FC_) layer.\r \r The graph representation extraction computes for every web element _i_ its set of _K_ neighboring web elements _$N_i$_. The _RN_ consists of a Convolutional Neural Net (_CNN_) and a positional encoder aimed to learn a visual representation _$v_i$_ for each web element _i ∈ {1, ..., N}_. The _GAT_ combines the visual representation _$v_i$_ of the web element _i_ to be classified and those of its neighbors, i.e., _$v_k$ ∀k ∈ $N_i$_ to compute the contextual representation _$c_i$_ for web element _i_. Finally, the visual and contextual representations of the web element are concatenated and passed through the _FC_ layer to obtain the classification output.""" ; skos:prefLabel "CoVA" . :CoVR a skos:Concept ; dcterms:source ; skos:altLabel "Composed Video Retrieval" ; skos:definition "The composed video retrieval (CoVR) task is a new task, where the goal is to find a video that matches both a query image and a query text. The query image represents a visual concept that the user is interested in, and the query text specifies how the concept should be modified or refined. For example, given an image of a fountain and the text _during show at night_, the CoVR task is to retrieve a video that shows the fountain at night with a show." ; skos:prefLabel "CoVR" . :CoVe a skos:Concept ; dcterms:source ; skos:altLabel "Contextual Word Vectors" ; skos:definition """**CoVe**, or **Contextualized Word Vectors**, uses a deep [LSTM](https://paperswithcode.com/method/lstm) encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. $\\text{CoVe}$ word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with $\\text{GloVe}$ embeddings:\r \r $$ v = \\left[\\text{GloVe}\\left(x\\right), \\text{CoVe}\\left(x\\right)\\right]$$\r \r and then feeding these in as features for the task-specific models.""" ; skos:prefLabel "CoVe" . :CoaT a skos:Concept ; dcterms:source ; skos:altLabel "Co-Scale Conv-attentional Image Transformer" ; skos:definition "**Co-Scale Conv-Attentional Image Transformer** (CoaT) is a [Transformer](https://paperswithcode.com/method/transformer)-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient [convolution](https://paperswithcode.com/method/convolution)-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities." ; skos:prefLabel "CoaT" . :CodeBERT a skos:Concept ; dcterms:source ; skos:definition "**CodeBERT** is a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. CodeBERT is developed with a [Transformer](https://paperswithcode.com/method/transformer)-based neural architecture, and is trained with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables the utilization of both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators." ; skos:prefLabel "CodeBERT" . :CodeGen a skos:Concept ; dcterms:source ; skos:definition "**CodeGen** is an autoregressive transformers with next-token prediction language modeling as the learning objective trained on a natural language corpus and programming language data curated from GitHub." ; skos:prefLabel "CodeGen" . :CodeSLAM a skos:Concept ; dcterms:source ; skos:definition "CodeSLAM represents the 3D geometry of a scene using the latent space of a variational autoencoder. The depth thus becomes a function of the RGB image and the unknown code, $D = G_\\theta(I,c)$. During training time, the weights of the network $G_\\theta$ are learnt by training the generator and encoder using a standard autoencoding task. At test time the code $c$ and the pose of the images is found by optimizing the reprojection error over multiple images." ; skos:prefLabel "CodeSLAM" . :CodeT5 a skos:Concept ; dcterms:source ; skos:definition "**CodeT5** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based model for code understanding and generation based on the [T5 architecture](https://paperswithcode.com/method/t5). It utilizes an identifier-aware pre-training objective that considers the crucial token type information (identifiers) from code. Specifically, the denoising [Seq2Seq](https://paperswithcode.com/method/seq2seq) objective of T5 is extended with two identifier tagging and prediction tasks to enable the model to better leverage the token type information from programming languages, which are the identifiers assigned by developers. To improve the natural language-programming language alignment, a bimodal dual learning objective is used for a bidirectional conversion between natural language and programming language." ; skos:prefLabel "CodeT5" . :CollaborativeDistillation a skos:Concept ; dcterms:source ; skos:definition "**Collaborative Distillation** is a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the number of convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models." ; skos:prefLabel "Collaborative Distillation" . :ColorJitter a skos:Concept ; skos:altLabel "Color Jitter" ; skos:definition """**ColorJitter** is a type of image data augmentation where we randomly change the brightness, contrast and saturation of an image.\r \r Image Credit: [Apache MXNet](https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html)""" ; skos:prefLabel "ColorJitter" . :Colorization a skos:Concept ; dcterms:source ; skos:definition "**Colorization** is a self-supervision approach that relies on colorization as the pretext task in order to learn image representations." ; skos:prefLabel "Colorization" . :ColorizationTransformer a skos:Concept ; dcterms:source ; skos:definition """**Colorization Transformer** is a probabilistic [colorization](https://paperswithcode.com/method/colorization) model composed only of [axial self-attention blocks](https://paperswithcode.com/method/axial). The main advantages of these blocks are the ability to capture a global receptive field with only two layers and $\\mathcal{O}(D\\sqrt{D})$ instead of $\\text{O}(D^{2})$ complexity. In order to enable colorization of high-resolution grayscale images, the task is decomposed into three simpler sequential subtasks: coarse low resolution autoregressive colorization, parallel color and spatial super-resolution.\r \r For coarse low resolution colorization, a conditional variant of [Axial Transformer](https://paperswithcode.com/method/axial) is applied. The authors leverage the semi-parallel sampling mechanism of Axial Transformers. Finally, fast parallel deterministic upsampling models are employed to super-resolve the coarsely colorized image into the final high resolution output.""" ; skos:prefLabel "Colorization Transformer" . :ComiRec a skos:Concept ; dcterms:source ; skos:definition "**ComiRec** is a multi-interest framework for sequential recommendation. The multi-interest module captures multiple interests from user behavior sequences, which can be exploited for retrieving candidate items from the large-scale item pool. These items are then fed into an aggregation module to obtain the overall recommendation. The aggregation module leverages a controllable factor to balance the recommendation accuracy and diversity." ; skos:prefLabel "ComiRec" . :CompactGlobalDescriptor a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Compact Global Descriptor** is an image model block for modelling interactions between positions across different dimensions (e.g., channels, frames). This descriptor enables subsequent convolutions to access the informative global features. It is a form of attention." ; skos:prefLabel "Compact Global Descriptor" . :ComplEx-N3 a skos:Concept ; dcterms:source ; skos:altLabel "ComplEx with N3 Regularizer" ; skos:definition "ComplEx model trained with a nuclear norm regularizer" ; skos:prefLabel "ComplEx-N3" . :ComplEx-N3-RP a skos:Concept ; dcterms:source ; skos:altLabel "ComplEx with N3 Regularizer and Relation Prediction Objective" ; skos:definition "ComplEx model trained with a nuclear norm regularizer; A relation prediction objective is added on top of the commonly used 1vsAll objective." ; skos:prefLabel "ComplEx-N3-RP" . :CompositeFields a skos:Concept ; dcterms:source ; skos:definition "Represent and associate with a composite of primitive fields." ; skos:prefLabel "Composite Fields" . :CompressedMemory a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Compressed Memory** is a secondary FIFO memory component proposed as part of the [Compressive Transformer](https://paperswithcode.com/method/compressive-transformer) model. The Compressive [Transformer](https://paperswithcode.com/method/transformer) keeps a fine-grained memory of past activations, which are then compressed into coarser compressed memories. \r \r For choices of compression functions $f\\_{c}$ the authors consider (1) max/mean pooling, where the kernel and stride is set to the compression rate $c$; (2) 1D [convolution](https://paperswithcode.com/method/convolution) also with kernel & stride set to $c$; (3) dilated convolutions; (4) *most-used* where the memories are sorted by their average attention (usage) and the most-used are preserved.""" ; skos:prefLabel "Compressed Memory" . :CompressiveTransformer a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """The **Compressive Transformer** is an extension to the [Transformer](https://paperswithcode.com/method/transformer) which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of [Transformer-XL](https://paperswithcode.com/method/transformer-xl) which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional [compressed memory](https://paperswithcode.com/method/compressed-memory).\r \r At each time step $t$, we discard the oldest compressed memories (FIFO) and then the oldest $n$ states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).""" ; skos:prefLabel "Compressive Transformer" . :ComputationRedistribution a skos:Concept ; dcterms:source ; skos:definition "**Computation Redistribution** is an [neural architecture search](https://paperswithcode.com/task/architecture-search) method for [face detection](https://paperswithcode.com/task/face-detection), which reallocates the computation between the backbone, neck and head of the model based on a predefined search methodology. Directly utilising the backbone of a classification network for scale-specific face detection can be sub-optimal. Therefore, [network structure search](https://paperswithcode.com/method/regnety) is used to reallocate the computation on the backbone, neck and head, under a wide range of flop regimes. The search method is applied to [RetinaNet](https://paperswithcode.com/method/retinanet), with [ResNet](https://paperswithcode.com/method/resnet) as backbone, [Path Aggregation Feature Pyramid Network](https://paperswithcode.com/method/pafpn) (PAFPN) as the neck and stacked 3 × 3 [convolutional layers](https://paperswithcode.com/method/convolution) for the head. While the general structure is simple, the total number of possible networks in the search space is unwieldy. In the first step, the authors explore the reallocation of the computation within the backbone parts (i.e. stem, C2, C3, C4, and C5), while fixing the neck and head components. Based on the optimised computation distribution on the backbone they find, they further explore the reallocation of the computation across the backbone, neck and head." ; skos:prefLabel "Computation Redistribution" . :ConViT a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**ConViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses a gated positional self-attention module ([GPSA](https://paperswithcode.com/method/gpsa)), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information." ; skos:prefLabel "ConViT" . :ConcatenatedSkipConnection a skos:Concept ; rdfs:seeAlso ; skos:definition "A **Concatenated Skip Connection** is a type of skip connection that seeks to reuse features by concatenating them to new layers, allowing more information to be retained from previous layers of the network. This contrasts with say, residual connections, where element-wise summation is used instead to incorporate information from previous layers. This type of skip connection is prominently used in DenseNets (and also Inception networks), which the Figure to the right illustrates." ; skos:prefLabel "Concatenated Skip Connection" . :ConcatenationAffinity a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Concatenation Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a concatenation function:\r \r $$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = \\text{ReLU}\\left(\\mathbb{w}^{T}\\_{f}\\left[\\theta\\left(\\mathbb{x}\\_{i}\\right), \\phi\\left(\\mathbb{x}\\_{j}\\right)\\right]\\right)$$\r \r Here $\\left[·, ·\\right]$ denotes concatenation and $\\mathbb{w}\\_{f}$ is a weight vector that projects the concatenated vector to a scalar.""" ; skos:prefLabel "Concatenation Affinity" . :ConcreteDropout a skos:Concept ; dcterms:source ; skos:definition "Please enter a description about the method here" ; skos:prefLabel "Concrete Dropout" . a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Concurrent Spatial and Channel Squeeze & Excitation (scSE)" ; skos:definition "Combines the channel attention of the widely known [spatial squeeze and channel excitation (SE)](https://paperswithcode.com/method/squeeze-and-excitation-block) block and the spatial attention of the [channel squeeze and spatial excitation (sSE)](https://paperswithcode.com/method/channel-squeeze-and-spatial-excitation#) block to build a spatial and channel attention mechanism for image segmentation tasks." ; skos:prefLabel "Concurrent Spatial and Channel Squeeze & Excitation" . :CondConv a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CondConv**, or **Conditionally Parameterized Convolutions**, are a type of [convolution](https://paperswithcode.com/method/convolution) which learn specialized convolutional kernels for each example. In particular, we parameterize the convolutional kernels in a CondConv layer as a linear combination of $n$ experts $(\\alpha_1 W_1 + \\ldots + \\alpha_n W_n) * x$, where $\\alpha_1, \\ldots, \\alpha_n$ are functions of the input learned through gradient descent. To efficiently increase the capacity of a CondConv layer, developers can increase the number of experts. This can be more computationally efficient than increasing the size of the convolutional kernel itself, because the convolutional kernel is applied at many different positions within the input, while the experts are combined only once per input." ; skos:prefLabel "CondConv" . :CondInst a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Conditional Convolutions for Instance Segmentation" ; skos:definition "CondInst is a simple yet effective instance segmentation framework. It eliminates ROI cropping and feature alignment with the instance-aware mask heads. As a result, CondInst can solve instance segmentation with fully convolutional networks. CondInst is able to produce high-resolution instance masks without longer computational time. Extensive experiments show that CondInst can achieve even better performance and inference speed than [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn). It can be a strong alternative to previous ROI-based instance segmentation methods. Code is at https://github.com/aim-uofa/AdelaiDet." ; skos:prefLabel "CondInst" . :ConditionalBatchNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Conditional Batch Normalization (CBN)** is a class-conditional variant of [batch normalization](https://paperswithcode.com/method/batch-normalization). The key idea is to predict the $\\gamma$ and $\\beta$ of the batch normalization from an embedding - e.g. a language embedding in VQA. CBN enables the linguistic embedding to manipulate entire feature maps by scaling them up or down, negating them, or shutting them off. CBN has also been used in [GANs](https://paperswithcode.com/methods/category/generative-adversarial-networks) to allow class information to affect the batch normalization parameters.\r \r Consider a single convolutional layer with batch normalization module $\\text{BN}\\left(F\\_{i,c,h,w}|\\gamma\\_{c}, \\beta\\_{c}\\right)$ for which pretrained scalars $\\gamma\\_{c}$ and $\\beta\\_{c}$ are available. We would like to directly predict these affine scaling parameters from, e.g., a language embedding $\\mathbf{e\\_{q}}$. When starting the training procedure, these parameters must be close to the pretrained values to recover the original [ResNet](https://paperswithcode.com/method/resnet) model as a poor initialization could significantly deteriorate performance. Unfortunately, it is difficult to initialize a network to output the pretrained $\\gamma$ and $\\beta$. For these reasons, the authors propose to predict a change $\\delta\\beta\\_{c}$ and $\\delta\\gamma\\_{c}$ on the frozen original scalars, for which it is straightforward to initialize a neural network to produce an output with zero-mean and small variance.\r \r The authors use a one-hidden-layer MLP to predict these deltas from a question embedding $\\mathbf{e\\_{q}}$ for all feature maps within the layer:\r \r $$\\Delta\\beta = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r \r $$\\Delta\\gamma = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r \r So, given a feature map with $C$ channels, these MLPs output a vector of size $C$. We then add these predictions to the $\\beta$ and $\\gamma$ parameters:\r \r $$ \\hat{\\beta}\\_{c} = \\beta\\_{c} + \\Delta\\beta\\_{c} $$\r \r $$ \\hat{\\gamma}\\_{c} = \\gamma\\_{c} + \\Delta\\gamma\\_{c} $$\r \r Finally, these updated $\\hat{β}$ and $\\hat{\\gamma}$ are used as parameters for the batch normalization: $\\text{BN}\\left(F\\_{i,c,h,w}|\\hat{\\gamma\\_{c}}, \\hat{\\beta\\_{c}}\\right)$. The authors freeze all ResNet parameters, including $\\gamma$ and $\\beta$, during training. A ResNet consists of\r four stages of computation, each subdivided in several residual blocks. In each block, the authors apply CBN to the three convolutional layers.""" ; skos:prefLabel "Conditional Batch Normalization" . :ConditionalDBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Conditional DBlock** is a residual based block used in the discriminator of the [GAN-TTS](https://paperswithcode.com/method/gan-tts) architecture. They are similar to the [GBlocks](https://paperswithcode.com/method/gblock) used in the generator, but without [batch normalization](https://paperswithcode.com/method/batch-normalization). Unlike the [DBlock](https://paperswithcode.com/method/dblock), the Conditional DBlock adds the embedding of the linguistic features after the first [convolution](https://paperswithcode.com/method/convolution)." ; skos:prefLabel "Conditional DBlock" . :ConditionalInstanceNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Conditional Instance Normalization** is a normalization technique where all convolutional weights of a style transfer network are shared across many styles. The goal of the procedure is transform\r a layer’s activations $x$ into a normalized activation $z$ specific to painting style $s$. Building off\r [instance normalization](https://paperswithcode.com/method/instance-normalization), we augment the $\\gamma$ and $\\beta$ parameters so that they’re $N \\times C$ matrices, where $N$ is the number of styles being modeled and $C$ is the number of output feature maps. Conditioning on a style is achieved as follows:\r \r $$ z = \\gamma\\_{s}\\left(\\frac{x - \\mu}{\\sigma}\\right) + \\beta\\_{s}$$\r \r where $\\mu$ and $\\sigma$ are $x$’s mean and standard deviation taken across spatial axes and $\\gamma\\_{s}$ and $\\beta\\_{s}$ are obtained by selecting the row corresponding to $s$ in the $\\gamma$ and $\\beta$ matrices. One added benefit of this approach is that one can stylize a single image into $N$ painting styles with a single feed forward pass of the network with a batch size of $N$.""" ; skos:prefLabel "Conditional Instance Normalization" . :ConditionalPositionalEncoding a skos:Concept ; dcterms:source ; skos:definition """**Conditional Positional Encoding**, or **CPE**, is a type of positional encoding for [vision transformers](https://paperswithcode.com/methods/category/vision-transformer). Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a [Position\r Encoding Generator](https://paperswithcode.com/method/positional-encoding-generator) (PEG) and incorporated into the current [Transformer framework](https://paperswithcode.com/methods/category/transformers).""" ; skos:prefLabel "Conditional Positional Encoding" . :Conffusion a skos:Concept ; dcterms:source ; skos:altLabel "Confidence Intervals for Diffusion Models" ; skos:definition "Given a corrupted input image, Con\\textit{ffusion}, repurposes a pretrained diffusion model to generate lower and upper bounds around each reconstructed pixel. The true pixel value is guaranteed to fall within these bounds with probability $p$." ; skos:prefLabel "Conffusion" . :Content-ConditionedStyleEncoder a skos:Concept ; dcterms:source ; skos:definition """The **Content-Conditioned Style Encoder**, or **COCO**, is a style encoder used for image-to-image translation in the [COCO-FUNIT](https://paperswithcode.com/method/coco-funit#) architecture. Unlike the style encoder in [FUNIT](https://arxiv.org/abs/1905.01723), COCO takes both content and style image as input. With this content conditioning scheme, we create a direct feedback path during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.\r \r The bottom part of the Figure details architecture. First, the content image is fed into an encoder $E\\_{S, C}$ to compute a spatial feature map. This content feature map is then mean-pooled and mapped to a vector $\\zeta\\_{c} .$ Similarly, the style image is fed into encoder $E\\_{S, S}$ to compute a spatial feature map. The style feature map is then mean-pooled and concatenated with an input-independent bias vector: the constant style bias (CSB). Note that while the regular bias in deep networks is added to the activations, in CSB, the bias is concatenated with the activations. The CSB provides a fixed input to the style encoder, which helps compute a style code that is less sensitive to the variations in the style image.\r \r The concatenation of the style vector and the CSB is mapped to a vector $\\zeta\\_{s}$ via a fully connected layer. We then perform an element-wise product operation to $\\zeta\\_{c}$ and $\\zeta\\_{s}$, which is the final style code. The style code is then mapped to produce the [AdaIN](https://paperswithcode.com/method/adaptive-instance-normalization) parameters for generating the translation. Through this element-wise product operation, the resulting style code is heavily influenced by the content image. One way to look at this mechanism is that it produces a customized style code for the input content image.\r \r The COCO is used as a drop-in replacement for the style encoder in FUNIT. Let $\\phi$ denote the COCO mapping. The translation output is then computed via\r \r $$\r z\\_{c}=E\\_{c}\\left(x_{c}\\right), z_{s}=\\phi\\left(E\\_{s, s}\\left(x_{s}\\right), E\\_{s, c}\\left(x\\_{c}\\right)\\right), \\overline{\\mathbf{x}}=F\\left(z\\_{c}, z\\_{s}\\right)\r $$\r \r The style code extracted by the COCO is more robust to variations in the style image. Note that we set $E\\_{S, C} \\equiv E\\_{C}$ to keep the number of parameters in our model similar to that in FUNIT.""" ; skos:prefLabel "Content-Conditioned Style Encoder" . :Content-basedAttention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Content-based attention** is an attention mechanism based on cosine similarity:\r \r $$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = \\cos\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right] $$\r \r It was utilised in [Neural Turing Machines](https://paperswithcode.com/method/neural-turing-machine) as part of the Addressing Mechanism.\r \r We produce a normalized attention weighting by taking a [softmax](https://paperswithcode.com/method/softmax) over these attention alignment scores.""" ; skos:prefLabel "Content-based Attention" . :ContextEnhancementModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Context Enhancement Module (CEM)** is a feature extraction module used in object detection (specifically, [ThunderNet](https://paperswithcode.com/method/thundernet)) which aims to to enlarge the receptive field. The key idea of CEM is to aggregate multi-scale local context information and global context information to generate more discriminative features. In CEM, the feature maps from three scales are merged: $C\\_{4}$, $C\\_{5}$ and $C\\_{glb}$. $C\\_{glb}$ is the global context feature vector by applying a [global average pooling](https://paperswithcode.com/method/global-average-pooling) on $C\\_{5}$. We then apply a 1 × 1 [convolution](https://paperswithcode.com/method/convolution) on each feature map to squeeze the number of channels to $\\alpha \\times p \\times p = 245$.\r \r Afterwards, $C\\_{5}$ is upsampled by 2× and $C\\_{glb}$ is broadcast so that the spatial dimensions of the three feature maps are\r equal. At last, the three generated feature maps are aggregated. By leveraging both local and global context, CEM effectively enlarges the receptive field and refines the representation ability of the thin feature map. Compared with prior [FPN](https://paperswithcode.com/method/fpn) structures, CEM involves only two 1×1 convolutions and a fc layer.""" ; skos:prefLabel "Context Enhancement Module" . :ContextualResidualAggregation a skos:Concept ; dcterms:source ; skos:definition "**Contextual Residual Aggregation**, or **CRA**, is a module for image inpainting. It can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. Specifically, it involves a neural network to predict a low-resolution inpainted result and up-sample it to yield a large blurry image. Then we produce the high-frequency residuals for in-hole patches by aggregating weighted high-frequency residuals from contextual patches. Finally, we add the aggregated residuals to the large blurry image to obtain a sharp result." ; skos:prefLabel "Contextual Residual Aggregation" . :ContextualizedTopicModels a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """Contextualized Topic Models are based on the Neural-ProdLDA variational autoencoding approach by Srivastava and Sutton (2017). \r \r This approach trains an encoding neural network to map pre-trained contextualized word embeddings (e.g., [BERT](https://paperswithcode.com/method/bert)) to latent representations. Those latent representations are sampled variationally from a Gaussian distribution $N(\\mu, \\sigma^2)$ and passed to a decoder network that has to reconstruct the document bag-of-word representation.""" ; skos:prefLabel "Contextualized Topic Models" . :ContractiveAutoencoder a skos:Concept ; skos:definition "A **Contractive Autoencoder** is an autoencoder that adds a penalty term to the classical reconstruction cost function. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. This penalty term results in a localized space contraction which in turn yields robust features on the activation layer. The penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold." ; skos:prefLabel "Contractive Autoencoder" . :ContrastiveLearning a skos:Concept ; skos:altLabel "None" ; skos:definition "" ; skos:prefLabel "Contrastive Learning" . :ContrastiveMultiviewCoding a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Contrastive Multiview Coding (CMC)** is a self-supervised learning approach, based on [CPC](https://paperswithcode.com/method/contrastive-predictive-coding), that learns representations that capture information shared between multiple sensory views. The core idea is to set an anchor view and the sample positive and negative data points from the other view and maximise agreement between positive pairs in learning from two views. Contrastive learning is used to build the embedding." ; skos:prefLabel "Contrastive Multiview Coding" . :ContrastivePredictiveCoding a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Contrastive Predictive Coding (CPC)** learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful\r to predict future samples.\r \r First, a non-linear encoder $g\\_{enc}$ maps the input sequence of observations $x\\_{t}$ to a sequence of latent representations $z\\_{t} = g\\_{enc}\\left(x\\_{t}\\right)$, potentially with a lower temporal resolution. Next, an autoregressive model $g\\_{ar}$ summarizes all $z\\leq{t}$ in the latent space and produces a context latent representation $c\\_{t} = g\\_{ar}\\left(z\\leq{t}\\right)$.\r \r A density ratio is modelled which preserves the mutual information between $x\\_{t+k}$ and $c\\_{t}$ as follows:\r \r $$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$\r \r where $\\propto$ stands for ’proportional to’ (i.e. up to a multiplicative constant). Note that the density ratio $f$ can be unnormalized (does not have to integrate to 1). The authors use a simple log-bilinear model:\r \r $$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) = \\exp\\left(z^{T}\\_{t+k}W\\_{k}c\\_{t}\\right) $$\r \r Any type of autoencoder and autoregressive can be used. An example the authors opt for is strided convolutional layers with residual blocks and GRUs.\r \r The autoencoder and autoregressive models are trained to minimize an [InfoNCE](https://paperswithcode.com/method/infonce) loss (see components).""" ; skos:prefLabel "Contrastive Predictive Coding" . :ControlVAE a skos:Concept ; dcterms:source ; skos:definition "**ControlVAE** is a [variational autoencoder](https://paperswithcode.com/method/vae) (VAE) framework that combines the automatic control theory with the basic VAE to stabilize the KL-divergence of VAE models to a specified value. It leverages a non-linear PI controller, a variant of the proportional-integral-derivative (PID) control, to dynamically tune the weight of the KL-divergence term in the evidence lower bound (ELBO) using the output KL-divergence as feedback. This allows for control of the KL-divergence to a desired value (set point), which is effective in avoiding posterior collapse and learning disentangled representations." ; skos:prefLabel "ControlVAE" . :ConvBERT a skos:Concept ; dcterms:source ; skos:definition "**ConvBERT** is a modification on the [BERT](https://paperswithcode.com/method/bert) architecture which uses a [span-based dynamic convolution](https://paperswithcode.com/method/span-based-dynamic-convolution) to replace self-attention heads to directly model local dependencies. Specifically a new [mixed attention module](https://paperswithcode.com/method/mixed-attention-block) replaces the [self-attention modules](https://paperswithcode.com/method/scaled) in BERT, which leverages the advantages of [convolution](https://paperswithcode.com/method/convolution) to better capture local dependency. Additionally, a new span-based dynamic convolution operation is used to utilize multiple input tokens to dynamically generate the convolution kernel. Lastly, ConvBERT also incorporates some new model designs including the bottleneck attention and grouped linear operator for the feed-forward module (reducing the number of parameters)." ; skos:prefLabel "ConvBERT" . :ConvLSTM a skos:Concept ; dcterms:source ; skos:definition """**ConvLSTM** is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions. The ConvLSTM determines the future state of a certain cell in the grid by the inputs and past states of its local neighbors. This can easily be achieved by using a [convolution](https://paperswithcode.com/method/convolution) operator in the state-to-state and input-to-state transitions (see Figure). The key equations of ConvLSTM are shown below, where $∗$ denotes the convolution operator and $\\odot$ the Hadamard product:\r \r $$ i\\_{t} = \\sigma\\left(W\\_{xi} ∗ X\\_{t} + W\\_{hi} ∗ H\\_{t−1} + W\\_{ci} \\odot \\mathcal{C}\\_{t−1} + b\\_{i}\\right) $$\r \r $$ f\\_{t} = \\sigma\\left(W\\_{xf} ∗ X\\_{t} + W\\_{hf} ∗ H\\_{t−1} + W\\_{cf} \\odot \\mathcal{C}\\_{t−1} + b\\_{f}\\right) $$\r \r $$ \\mathcal{C}\\_{t} = f\\_{t} \\odot \\mathcal{C}\\_{t−1} + i\\_{t} \\odot \\text{tanh}\\left(W\\_{xc} ∗ X\\_{t} + W\\_{hc} ∗ \\mathcal{H}\\_{t−1} + b\\_{c}\\right) $$\r \r $$ o\\_{t} = \\sigma\\left(W\\_{xo} ∗ X\\_{t} + W\\_{ho} ∗ \\mathcal{H}\\_{t−1} + W\\_{co} \\odot \\mathcal{C}\\_{t} + b\\_{o}\\right) $$\r \r $$ \\mathcal{H}\\_{t} = o\\_{t} \\odot \\text{tanh}\\left(C\\_{t}\\right) $$\r \r If we view the states as the hidden representations of moving objects, a ConvLSTM with a larger transitional kernel should be able to capture faster motions while one with a smaller kernel can capture slower motions. \r \r To ensure that the states have the same number of rows and same number of columns as the inputs, padding is needed before applying the convolution operation. Here, padding of the hidden states on the boundary points can be viewed as using the state of the outside world for calculation. Usually, before the first input comes, we initialize all the states of the [LSTM](https://paperswithcode.com/method/lstm) to zero which corresponds to "total ignorance" of the future.""" ; skos:prefLabel "ConvLSTM" . :ConvMLP a skos:Concept ; dcterms:source ; skos:definition "**ConvMLP** is a hierarchical convolutional MLP for visual recognition, which consists of a stage-wise, co-design of [convolution](https://paperswithcode.com/method/convolution) layers, and MLPs. The Conv Stage consists of $C$ convolutional blocks with $1\\times 1$ and $3\\times 3$ kernel sizes. It is repeated $M$ times before a down convolution is utilized to express a level $L$. The MLP-Conv Stage consists of Channelwise MLPs, with skip layers, and a [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution). This is repeated $M$ times before a down convolution is utilized to express a level $\\mathcal{L}$." ; skos:prefLabel "ConvMLP" . :ConvNeXt a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "ConvNeXt" . :ConvTasNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Convolutional time-domain audio separation network" ; skos:definition "Combines learned time-frequency representation with a masker architecture based on 1D [dilated convolution](https://paperswithcode.com/method/dilated-convolution)." ; skos:prefLabel "ConvTasNet" . :Convolution a skos:Concept ; skos:definition """A **convolution** is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.\r \r Intuitively, a convolution allows for weight sharing - reducing the number of effective parameters - and image translation (allowing for the same feature to be detected in different parts of the input space).\r \r Image Source: [https://arxiv.org/pdf/1603.07285.pdf](https://arxiv.org/pdf/1603.07285.pdf)""" ; skos:prefLabel "Convolution" . :CoordConv a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **CoordConv** layer is a simple extension to the standard convolutional layer. It has the same functional signature as a convolutional layer, but accomplishes the mapping by first concatenating extra channels to the incoming representation. These channels contain hard-coded coordinates, the most basic version of which is one channel for the $i$ coordinate and one for the $j$ coordinate.\r \r The CoordConv layer keeps the properties of few parameters and efficient computation from convolutions, but allows the network to learn to keep or to discard translation invariance as is needed for the task being learned. This is useful for coordinate transform based tasks where regular convolutions can fail.""" ; skos:prefLabel "CoordConv" . :Coordinateattention a skos:Concept ; dcterms:source ; skos:definition """Hou et al. proposed coordinate attention,\r a novel attention mechanism which\r embeds positional information into channel attention,\r so that the network can focus on large important regions \r at little computational cost.\r \r The coordinate attention mechanism has two consecutive steps, coordinate information embedding and coordinate attention generation. First, two spatial extents of pooling kernels encode each channel horizontally and vertically. In the second step, a shared $1\\times 1$ convolutional transformation function is applied to the concatenated outputs of the two pooling layers. Then coordinate attention splits the resulting tensor into two separate tensors to yield attention vectors with the same number of channels for horizontal and vertical coordinates of the input $X$ along. This can be written as \r \\begin{align}\r z^h &= \\text{GAP}^h(X) \r \\end{align}\r \\begin{align}\r z^w &= \\text{GAP}^w(X)\r \\end{align}\r \\begin{align}\r f &= \\delta(\\text{BN}(\\text{Conv}_1^{1\\times 1}([z^h;z^w])))\r \\end{align}\r \\begin{align}\r f^h, f^w &= \\text{Split}(f)\r \\end{align}\r \\begin{align}\r s^h &= \\sigma(\\text{Conv}_h^{1\\times 1}(f^h))\r \\end{align}\r \\begin{align}\r s^w &= \\sigma(\\text{Conv}_w^{1\\times 1}(f^w))\r \\end{align}\r \\begin{align}\r Y &= X s^h s^w\r \\end{align}\r where $\\text{GAP}^h$ and $\\text{GAP}^w$ denote pooling functions for vertical and horizontal coordinates, and $s^h \\in \\mathbb{R}^{C\\times 1\\times W}$ and $s^w \\in \\mathbb{R}^{C\\times H\\times 1}$ represent corresponding attention weights. \r \r Using coordinate attention, the network can accurately obtain the position of a targeted object.\r This approach has a larger receptive field than BAM and CBAM.\r Like an SE block, it also models cross-channel relationships, effectively enhancing the expressive power of the learned features.\r Due to its lightweight design and flexibility, \r it can be easily used in classical building blocks of mobile networks.""" ; skos:prefLabel "Coordinate attention" . :Copy-Paste a skos:Concept ; dcterms:source ; skos:altLabel "simple Copy-Paste" ; skos:definition "" ; skos:prefLabel "Copy-Paste" . :Coresets a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Coresets" . :CornerNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CornerNet** is an object detection model that detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single [convolution](https://paperswithcode.com/method/convolution) neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. It also utilises [corner pooling](https://paperswithcode.com/method/corner-pooling), a new type of pooling layer than helps the network better localize corners." ; skos:prefLabel "CornerNet" . :CornerNet-Saccade a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CornerNet-Saccade** is an extension of [CornerNet](https://paperswithcode.com/method/cornernet) with an attention mechanism similar to saccades in human vision. It starts with a downsized full image and generates an attention map, which is then zoomed in on and processed further by the model. This differs from the original CornerNet in that it is applied fully convolutionally across multiple scales." ; skos:prefLabel "CornerNet-Saccade" . :CornerNet-Squeeze a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CornerNet-Squeeze** is an object detector that extends [CornerNet](https://paperswithcode.com/method/cornernet) with a new compact hourglass architecture that makes use of fire modules with depthwise separable convolutions." ; skos:prefLabel "CornerNet-Squeeze" . :CornerNet-SqueezeHourglass a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CornerNet-Squeeze Hourglass** is a convolutional neural network and object detection backbone used in the [CornerNet-Squeeze](https://paperswithcode.com/method/cornernet-squeeze) object detector. It uses a modified [hourglass module](https://paperswithcode.com/method/hourglass-module) that makes use of a [fire module](https://paperswithcode.com/method/fire-module): containing 1x1 convolutions and depthwise convolutions." ; skos:prefLabel "CornerNet-Squeeze Hourglass" . :CornerNet-SqueezeHourglassModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**CornerNet-Squeeze Hourglass Module** is an image model block used in [CornerNet](https://paperswithcode.com/method/cornernet)-Lite that is based on an [hourglass module](https://paperswithcode.com/method/hourglass-module), but uses modified fire modules instead of residual blocks. Other than replacing the residual blocks, further modifications include: reducing the maximum feature map resolution of the hourglass modules by adding one more downsampling layer before the hourglass modules, removing one downsampling layer in each hourglass module, replacing the 3 × 3 filters with 1 x 1 filters in the prediction modules of CornerNet, and finally replacing the nearest neighbor upsampling in the hourglass network with transpose [convolution](https://paperswithcode.com/method/convolution) with a 4 × 4 kernel." ; skos:prefLabel "CornerNet-Squeeze Hourglass Module" . :CornerPooling a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Corner Pooling** is a pooling technique for object detection that seeks to better localize corners by encoding explicit prior knowledge. Suppose we want to determine if a pixel at location $\\left(i, j\\right)$ is a top-left corner. Let $f\\_{t}$ and $f\\_{l}$ be the feature maps that are the inputs to the top-left corner pooling layer, and let $f\\_{t\\_{ij}}$ and $f\\_{l\\_{ij}}$ be the vectors at location $\\left(i, j\\right)$ in $f\\_{t}$ and $f\\_{l}$ respectively. With $H \\times W$ feature maps, the corner pooling layer first max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(i, H\\right)$ in $f\\_{t}$ to a feature vector $t\\_{ij}$ , and max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(W, j\\right)$ in $f\\_{l}$ to a feature vector $l\\_{ij}$. Finally, it adds $t\\_{ij}$ and $l\\_{ij}$ together." ; skos:prefLabel "Corner Pooling" . :CosLU a skos:Concept ; dcterms:source ; skos:altLabel "Cosine Linear Unit" ; skos:definition """The **Cosine Linear Unit**, or **CosLU**, is a type of activation function that has trainable parameters and uses the cosine function.\r \r $$CosLU(x) = (x + \\alpha \\cos(\\beta x))\\sigma(x)$$""" ; skos:prefLabel "CosLU" . :CosineAnnealing a skos:Concept ; dcterms:source ; skos:definition """**Cosine Annealing** is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again. The resetting of the learning rate acts like a simulated restart of the learning process and the re-use of good weights as the starting point of the restart is referred to as a "warm restart" in contrast to a "cold restart" where a new set of small random numbers may be used as a starting point.\r \r $$\\eta\\_{t} = \\eta\\_{min}^{i} + \\frac{1}{2}\\left(\\eta\\_{max}^{i}-\\eta\\_{min}^{i}\\right)\\left(1+\\cos\\left(\\frac{T\\_{cur}}{T\\_{i}}\\pi\\right)\\right)\r $$\r \r Where where $\\eta\\_{min}^{i}$ and $ \\eta\\_{max}^{i}$ are ranges for the learning rate, and $T\\_{cur}$ account for how many epochs have been performed since the last restart.\r \r Text Source: [Jason Brownlee](https://machinelearningmastery.com/snapshot-ensemble-deep-learning-neural-network/)\r \r Image Source: [Gao Huang](https://www.researchgate.net/figure/Training-loss-of-100-layer-DenseNet-on-CIFAR10-using-standard-learning-rate-blue-and-M_fig2_315765130)""" ; skos:prefLabel "Cosine Annealing" . :CosineNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """Multi-layer neural networks traditionally use dot products between the output vector of previous layer and the incoming weight vector as the input to activation function. The result of dot product is unbounded. To bound dot product and decrease the variance, **Cosine Normalization** uses cosine similarity or centered cosine similarity (Pearson Correlation Coefficient) instead of dot products in neural networks. \r \r Using cosine normalization, the output of a hidden unit is computed by:\r \r $$o = f(net_{norm})= f(\\cos \\theta) = f(\\frac{\\vec{w} \\cdot \\vec{x}} {\\left|\\vec{w}\\right| \\left|\\vec{x}\\right|})$$\r \r where $net_{norm}$ is the normalized pre-activation, $\\vec{w}$ is the incoming weight vector and $\\vec{x}$ is the input vector, ($\\cdot$) indicates dot product, $f$ is nonlinear activation function. Cosine normalization bounds the pre-activation between -1 and 1.""" ; skos:prefLabel "Cosine Normalization" . :CosinePowerAnnealing a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "Interpolation between [exponential decay](https://paperswithcode.com/method/exponential-decay) and [cosine annealing](https://paperswithcode.com/method/cosine-annealing)." ; skos:prefLabel "Cosine Power Annealing" . :Counterfactuals a skos:Concept ; dcterms:source ; skos:altLabel "Counterfactuals Explanations" ; skos:definition "" ; skos:prefLabel "Counterfactuals" . :Cross-AttentionModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "The **Cross-Attention** module is an attention module used in [CrossViT](https://paperswithcode.com/method/crossvit) for fusion of multi-scale features. The CLS token of the large branch (circle) serves as a query token to interact with the patch tokens from the small branch through attention. $f\\left(·\\right)$ and $g\\left(·\\right)$ are projections to align dimensions. The small branch follows the same procedure but swaps CLS and patch tokens from another branch." ; skos:prefLabel "Cross-Attention Module" . :Cross-CovarianceAttention a skos:Concept ; dcterms:source ; skos:definition """**Cross-Covariance Attention**, or **XCA**, is an [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) which operates along the feature dimension instead of the token dimension as in [conventional transformers](https://paperswithcode.com/methods/category/transformers).\r \r Using the definitions of queries, keys and values from conventional attention, the cross-covariance attention function is defined as:\r \r $$\r \\text { XC-Attention }(Q, K, V)=V \\mathcal{A}_{\\mathrm{XC}}(K, Q), \\quad \\mathcal{A}\\_{\\mathrm{XC}}(K, Q)=\\operatorname{Softmax}\\left(\\hat{K}^{\\top} \\hat{Q} / \\tau\\right)\r $$\r \r where each output token embedding is a convex combination of the $d\\_{v}$ features of its corresponding token embedding in $V$. The attention weights $\\mathcal{A}$ are computed based on the cross-covariance matrix.""" ; skos:prefLabel "Cross-Covariance Attention" . :Cross-ScaleNon-LocalAttention a skos:Concept ; dcterms:source ; skos:definition "**Cross-Scale Non-Local Attention**, or **CS-NL**, is a non-local attention module for image super-resolution deep networks. It learns to mine long-range dependencies between LR features to larger-scale HR patches within the same feature map. Specifically, suppose we are conducting an s-scale super-resolution with the module, given a feature map $X$ of spatial size $(W, H)$, we first bilinearly downsample it to $Y$ with scale $s$, and match the $p\\times p$ patches in $X$ with the downsampled $p \\times p$ candidates in $Y$ to obtain the [softmax](https://paperswithcode.com/method/softmax) matching score. Finally, we conduct deconvolution.on the score by weighted adding the patches of size $\\left(sp, sp\\right)$ extracted from $X$. The obtained $Z$ of size $(sW, sH)$ will be $s$ times super-resolved than $X$." ; skos:prefLabel "Cross-Scale Non-Local Attention" . :Cross-ViewTraining a skos:Concept ; dcterms:source ; skos:definition """**Cross View Training**, or **CVT**, is a semi-supervised algorithm for training distributed word representations that makes use of unlabelled and labelled examples. \r \r CVT adds $k$ auxiliary prediction modules to the model, a Bi-[LSTM](https://paperswithcode.com/method/lstm) encoder, which are used when learning on unlabeled examples. A prediction module is usually a small neural network (e.g., a hidden layer followed by a [softmax](https://paperswithcode.com/method/softmax) layer). Each one takes as input an intermediate representation $h^j(x_i)$ produced by the model (e.g., the outputs of one of the LSTMs in a Bi-LSTM model). It outputs a distribution over labels $p\\_{j}^{\\theta}\\left(y\\mid{x\\_{i}}\\right)$.\r \r Each $h^j$ is chosen such that it only uses a part of the input $x_i$; the particular choice can depend on the task and model architecture. The auxiliary prediction modules are only used during training; the test-time prediction come from the primary prediction module that produces $p_\\theta$.""" ; skos:prefLabel "Cross-View Training" . :Cross-encoderReranking a skos:Concept ; dcterms:source ; skos:definition "Cross-encoder Reranking" ; skos:prefLabel "Cross-encoder Reranking" . :Cross-resolutionfeatures a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Cross-resolution features" . :CrossTransformers a skos:Concept ; dcterms:source ; skos:definition "CrossTransformers is a Transformer-based neural network architecture which can take a small number of labeled images and an unlabeled query, find coarse spatial correspondence between the query and the labeled images, and then infer class membership by computing distances between spatially-corresponding features." ; skos:prefLabel "CrossTransformers" . :CrossViT a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**CrossViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses a dual-branch architecture to extract multi-scale feature representations for image classification. The architecture combines image patches (i.e. tokens in a [transformer](https://paperswithcode.com/method/transformer)) of different sizes to produce stronger visual features for image classification. It processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other.\r \r Fusion is achieved by an efficient [cross-attention module](https://paperswithcode.com/method/cross-attention-module), in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention. This allows for linear-time generation of the attention map in fusion instead of quadratic time otherwise.""" ; skos:prefLabel "CrossViT" . :Crossbow a skos:Concept ; dcterms:source ; skos:definition "**Crossbow** is a single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size—however small—while scaling to multiple GPUs. Crossbow uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. [SMA](https://paperswithcode.com/method/slime-mould-algorithm-sma), a synchronous variant of model averaging, is used in which replicas independently explore the solution space with gradient descent, but adjust their search synchronously based on the trajectory of a globally-consistent average model." ; skos:prefLabel "Crossbow" . :CuBERT a skos:Concept ; dcterms:source ; skos:definition "**CuBERT**, or **Code Understanding BERT**, is a [BERT](https://paperswithcode.com/method/bert) based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of [Allamanis (2018)](https://arxiv.org/abs/1812.06469). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique)." ; skos:prefLabel "CuBERT" . :CubeRE a skos:Concept ; dcterms:source ; skos:definition "Our model known as CubeRE first encodes each input sentence using a language model encoder to obtain the contextualized sequence representation. We then capture the interaction between each possible head and tail entity as a pair representation for predicting the entity-relation label scores. To reduce the computational cost, each sentence is pruned to retain only words that have higher entity scores. Finally, we capture the interaction between each possible relation triplet and qualifier to predict the qualifier label scores and decode the outputs." ; skos:prefLabel "CubeRE" . :CurricularFace a skos:Concept ; dcterms:source ; skos:definition "**CurricularFace**, or **Adaptive Curriculum Learning**, is a method for face recognition that embeds the idea of curriculum learning into the loss function to achieve a new training scheme. This training scheme mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages." ; skos:prefLabel "CurricularFace" . :CurvVAE a skos:Concept ; dcterms:source ; skos:altLabel "Curvature Regularized Variational Auto-Encoder" ; skos:definition "" ; skos:prefLabel "CurvVAE" . :CutBlur a skos:Concept ; dcterms:source ; skos:definition "**CutBlur** is a data augmentation method that is specifically designed for the low-level vision tasks. It cuts a low-resolution patch and pastes it to the corresponding high-resolution image region and vice versa. The key intuition of Cutblur is to enable a model to learn not only \"how\" but also \"where\" to super-resolve an image. By doing so, the model can understand \"how much\" instead of blindly learning to apply super-resolution to every given pixel." ; skos:prefLabel "CutBlur" . :CutMix a skos:Concept ; dcterms:source ; skos:definition "**CutMix** is an image data augmentation strategy. Instead of simply removing pixels as in [Cutout](https://paperswithcode.com/method/cutout), we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. The added patches further enhance localization ability by requiring the model to identify the object from a partial view." ; skos:prefLabel "CutMix" . :Cutout a skos:Concept ; dcterms:source ; skos:definition """**Cutout** is an image augmentation and regularization technique that randomly masks out square regions of input during training. and can be used to improve the robustness and overall performance of convolutional neural networks. The main motivation for cutout comes from the problem of object occlusion, which is commonly encountered in many computer vision tasks, such as object recognition,\r tracking, or human pose estimation. By generating new images which simulate occluded examples, we not only better prepare the model for encounters with occlusions in the real world, but the model also learns to take more of the image context into consideration when making decisions""" ; skos:prefLabel "Cutout" . :CvT a skos:Concept ; dcterms:source ; skos:altLabel "Convolutional Vision Transformer" ; skos:definition """The **Convolutional vision Transformer (CvT)** is an architecture which incorporates convolutions into the [Transformer](https://paperswithcode.com/method/transformer). The CvT design introduces convolutions to two core sections of the ViT architecture.\r \r First, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping [convolution](https://paperswithcode.com/method/convolution) operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by [layer normalization](https://paperswithcode.com/method/layer-normalization). This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs. \r \r Second, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s × s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4× or more, with minimal degradation of performance.""" ; skos:prefLabel "CvT" . :Cycle-CenterNet a skos:Concept ; dcterms:source ; skos:definition "**Cycle-CenterNet** is a table structure parsing approach built on [CenterNet](https://paperswithcode.com/method/centernet) that uses a cycle-pairing module to simultaneously detect and group tabular cells into structured tables. It also utilizes a pairing loss which enables the grouping of discrete cells into the structured tables." ; skos:prefLabel "Cycle-CenterNet" . :CycleConsistencyLoss a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Cycle Consistency Loss** is a type of loss used for generative adversarial networks that performs unpaired image-to-image translation. It was introduced with the [CycleGAN](https://paperswithcode.com/method/cyclegan) architecture. For two domains $X$ and $Y$, we want to learn a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. We want to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. Cycle Consistency Loss encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(F\\left(y\\right)\\right) \\approx y$. It reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r \r $$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$""" ; skos:prefLabel "Cycle Consistency Loss" . :CycleGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**CycleGAN**, or **Cycle-Consistent GAN**, is a type of generative adversarial network for unpaired image-to-image translation. For two domains $X$ and $Y$, CycleGAN learns a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. The novelty lies in trying to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. This is achieved through a [cycle consistency loss](https://paperswithcode.com/method/cycle-consistency-loss) that encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(F\\left(y\\right)\\right) \\approx y$. Combining this loss with the adversarial losses on $X$ and $Y$ yields the full objective for unpaired image-to-image translation.\r \r For the mapping $G : X \\rightarrow Y$ and its discriminator $D\\_{Y}$ we have the objective:\r \r $$ \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) =\\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[\\log D\\_{Y}\\left(y\\right)\\right] + \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[log(1 − D\\_{Y}\\left(G\\left(x\\right)\\right)\\right] $$\r \r where $G$ tries to generate images $G\\left(x\\right)$ that look similar to images from domain $Y$, while $D\\_{Y}$ tries to discriminate between translated samples $G\\left(x\\right)$ and real samples $y$. A similar loss is postulated for the mapping $F: Y \\rightarrow X$ and its discriminator $D\\_{X}$.\r \r The Cycle Consistency Loss reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r \r $$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$\r \r The full objective is:\r \r $$ \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) = \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) + \\mathcal{L}\\_{GAN}\\left(F, D\\_{X}, X, Y\\right) + \\lambda\\mathcal{L}\\_{cyc}\\left(G, F\\right) $$\r \r Where we aim to solve:\r \r $$ G^{\\*}, F^{\\*} = \\arg \\min\\_{G, F} \\max\\_{D\\_{X}, D\\_{Y}} \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) $$\r \r For the original architecture the authors use:\r \r - two stride-2 convolutions, several residual blocks, and two fractionally strided convolutions with stride $\\frac{1}{2}$.\r - [instance normalization](https://paperswithcode.com/method/instance-normalization)\r - PatchGANs for the discriminator\r - Least Square Loss for the [GAN](https://paperswithcode.com/method/gan) objectives.""" ; skos:prefLabel "CycleGAN" . :CyclicalLearningRatePolicy a skos:Concept ; skos:definition """A **Cyclical Learning Rate Policy** combines a linear learning rate decay with warm restarts.\r \r Image: [ESPNetv2](https://paperswithcode.com/method/espnetv2)""" ; skos:prefLabel "Cyclical Learning Rate Policy" . :D4PG a skos:Concept ; dcterms:source ; skos:altLabel "Distributed Distributional DDPG" ; skos:definition "**D4PG**, or **Distributed Distributional DDPG**, is a policy gradient algorithm that extends upon the [DDPG](https://paperswithcode.com/method/ddpg). The improvements include a distributional updates to the DDPG algorithm, combined with the use of multiple distributed workers all writing into the same replay table. The biggest performance gain of other simpler changes was the use of $N$-step returns. The authors found that the use of [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) was less crucial to the overall D4PG algorithm especially on harder problems." ; skos:prefLabel "D4PG" . :DABMD a skos:Concept ; skos:altLabel "Distributed Any-Batch Mirror Descent" ; skos:definition "**Distributed Any-Batch Mirror Descent** (DABMD) is based on distributed Mirror Descent but uses a fixed per-round computing time to limit the waiting by fast nodes to receive information updates from slow nodes. DABMD is characterized by varying minibatch sizes across nodes. It is applicable to a broader range of problems compared with existing distributed online optimization methods such as those based on dual averaging, and it accommodates time-varying network topology." ; skos:prefLabel "DABMD" . :DAC a skos:Concept ; dcterms:source ; skos:altLabel "Dynamic Algorithm Configuration" ; skos:definition """Dynamic algorithm configuration (DAC) is capable of generalizing over prior optimization approaches, as well as handling optimization of hyperparameters that need to be adjusted over multiple time-steps.\r \r Image Source: [Biedenkapp et al.](http://ecai2020.eu/papers/1237_paper.pdf)""" ; skos:prefLabel "DAC" . :DAEL a skos:Concept ; dcterms:source ; skos:altLabel "Domain Adaptive Ensemble Learning" ; skos:definition "**Domain Adaptive Ensemble Learning**, or **DAEL**, is an architecture for domain adaptation. The model is composed of a CNN feature extractor shared across domains and multiple classifier heads each trained to specialize in a particular source domain. Each such classifier is an expert to its own domain and a non-expert to others. DAEL aims to learn these experts collaboratively so that when forming an ensemble, they can leverage complementary information from each other to be more effective for an unseen target domain. To this end, each source domain is used in turn as a pseudo-target-domain with its own expert providing supervisory signal to the ensemble of non-experts learned from the other sources. For unlabeled target data under the UDA setting where real expert does not exist, DAEL uses pseudo-label to supervise the ensemble learning." ; skos:prefLabel "DAEL" . :DAFNe a skos:Concept ; dcterms:source ; skos:definition "**DAFNe** is a dense one-stage anchor-free deep model for oriented object detection. It is a deep neural network that performs predictions on a dense grid over the input image, being architecturally simpler in design, as well as easier to optimize than its two-stage counterparts. Furthermore, it reduces the prediction complexity by refraining from employing bounding box anchors. This enables a tighter fit to oriented objects, leading to a better separation of bounding boxes especially in case of dense object distributions. Moreover, it introduces an orientation-aware generalization of the center-ness function to arbitrary quadrilaterals that takes into account the object's orientation and that, accordingly, accurately down-weights low-quality predictions" ; skos:prefLabel "DAFNe" . :DAGNN a skos:Concept ; dcterms:source ; skos:altLabel "Directed Acyclic Graph Neural Network" ; skos:definition "A GNN for dags, which injects their topological order as an inductive bias via asynchronous message passing." ; skos:prefLabel "DAGNN" . :DALL·E2 a skos:Concept ; dcterms:source ; skos:definition "**DALL·E 2** is a generative text-to-image model made up of two main components: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding." ; skos:prefLabel "DALL·E 2" . :DAMO-YOLO a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "DAMO-YOLO" . :DANCE a skos:Concept ; dcterms:source ; skos:altLabel "Domain Adaptative Neighborhood Clustering via Entropy Optimization" ; skos:definition "**Domain Adaptive Neighborhood Clustering via Entropy Optimization (DANCE)** is a self-supervised clustering method that harnesses the cluster structure of the target domain using self-supervision. This is done with a neighborhood clustering technique that self-supervises feature learning in the target. At the same time, useful source features and class boundaries are preserved and adapted with a partial domain alignment loss that the authors refer to as entropy separation loss. This loss allows the model to either match each target example with the source, or reject it as unknown." ; skos:prefLabel "DANCE" . :DANet a skos:Concept ; dcterms:source ; skos:altLabel "Dual Attention Network" ; skos:definition """In the field of scene segmentation,\r encoder-decoder structures cannot make use of the global relationships \r between objects, whereas RNN-based structures \r heavily rely on the output of the long-term memorization.\r To address the above problems, \r Fu et al. proposed a novel framework, \r the dual attention network (DANet), \r for natural scene image segmentation. \r Unlike CBAM and BAM, it adopts a self-attention mechanism \r instead of simply stacking convolutions to compute the spatial attention map,\r which enables the network to capture global information directly. \r \r DANet uses in parallel a position attention module and a channel attention module to capture feature dependencies in spatial and channel domains. Given the input feature map $X$, convolution layers are applied first in the position attention module to obtain new feature maps. Then the position attention module selectively aggregates the features at each position using a weighted sum of features at all positions, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally the outputs from the two branches are fused to obtain final feature representations. For simplicity, we reshape the feature map $X$ to $C\\times (H \\times W)$ whereupon the overall process can be written as \r \\begin{align}\r Q,\\quad K,\\quad V &= W_qX,\\quad W_kX,\\quad W_vX\r \\end{align}\r \\begin{align}\r Y^\\text{pos} &= X+ V\\text{Softmax}(Q^TK)\r \\end{align}\r \\begin{align}\r Y^\\text{chn} &= X+ \\text{Softmax}(XX^T)X \r \\end{align}\r \\begin{align}\r Y &= Y^\\text{pos} + Y^\\text{chn}\r \\end{align}\r where $W_q$, $W_k$, $W_v \\in \\mathbb{R}^{C\\times C}$ are used to generate new feature maps. \r \r The position attention module enables\r DANet to capture long-range contextual information\r and adaptively integrate similar features at any scale\r from a global viewpoint,\r while the channel attention module is responsible for \r enhancing useful channels \r as well as suppressing noise. \r Taking spatial and channel \r relationships into consideration explicitly\r improves the feature representation for scene segmentation.\r However, it is computationally costly, especially for large input feature maps.""" ; skos:prefLabel "DANet" . :DAPO a skos:Concept ; dcterms:source ; skos:altLabel "Dialogue-Adaptive Pre-training Objective" ; skos:definition "**Dialogue-Adaptive Pre-training Objective (DAPO)** is a pre-training objective for dialogue adaptation, which is designed to measure qualities of dialogues from multiple important aspects, like Readability, Consistency and Fluency which have already been focused on by general LM pre-training objectives, and those also significant for assessing dialogues but ignored by general LM pre-training objectives, like Diversity and Specificity." ; skos:prefLabel "DAPO" . :DARTS a skos:Concept ; dcterms:source ; skos:altLabel "Differentiable Architecture Search" ; skos:definition "**Differentiable Architecture Search** (**DART**) is a method for efficient architecture search. The search space is made continuous so that the architecture can be optimized with respect to its validation set performance through gradient descent." ; skos:prefLabel "DARTS" . :DARTSMax-W a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Differentiable Architecture Search Max-W" ; skos:definition """Like [DARTS](https://paperswithcode.com/method/darts), except subtract the max weight gradients.\r \r Max-W Weighting:\r \\begin{equation}\r output_i = (1 - max(w) + w_i) * op_i(input_i)\r \\label{eqn:max_w}\r \\end{equation}""" ; skos:prefLabel "DARTS Max-W" . :DASPP a skos:Concept ; dcterms:source ; skos:altLabel "Deeper Atrous Spatial Pyramid Pooling" ; skos:definition "DASPP is a deeper version of the [ASPP](https://paperswithcode.com/method/aspp) module (the latter from [DeepLabv3](https://paperswithcode.com/method/deeplabv3)) that adds standard 3 × 3 [convolution](https://paperswithcode.com/method/convolution) after 3 × 3 dilated convolutions to refine the features and also fusing the input and the output of the DASPP module via short [residual connection](https://paperswithcode.com/method/residual-connection). Also, the number of convolution filters of ASPP is reduced from 255 to 96 to gain computational performance." ; skos:prefLabel "DASPP" . :DAU-ConvNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Displaced Aggregation Units" ; skos:definition """**Displaced Aggregation Unit** replaces classic [convolution](https://paperswithcode.com/method/convolution) layer in ConvNets with learnable positions of units. This introduces explicit structure of hierarchical compositions and results in several benefits:\r \r * fully adjustable and **learnable receptive fields** through spatially-adjustable filter units\r * **reduced parameters** for spatial coverage\r efficient inference\r * **decupling** of the parameters from the receptive field sizes\r \r More information can be found [here.](https://www.vicos.si/Research/DeepCompositionalNet)""" ; skos:prefLabel "DAU-ConvNet" . :DBGAN a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Distribution-induced Bidirectional Generative Adversarial Network for Graph Representation Learning" ; skos:definition """DBGAN is a method for graph representation learning. Instead of the widely used normal distribution assumption, the prior distribution of latent representation in DBGAN is estimated in a structure-aware way, which implicitly bridges the graph and feature spaces by prototype learning.\r \r Source: [Distribution-induced Bidirectional Generative Adversarial Network for Graph Representation Learning](https://arxiv.org/abs/1912.01899)""" ; skos:prefLabel "DBGAN" . :DBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DBlock** is a residual based block used in the discriminator of the [GAN-TTS](https://paperswithcode.com/method/gan-tts) architecture. They are similar to the [GBlocks](https://paperswithcode.com/method/gblock) used in the generator, but without batch normalisation." ; skos:prefLabel "DBlock" . :DCGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Deep Convolutional GAN" ; skos:definition """**DCGAN**, or **Deep Convolutional GAN**, is a generative adversarial network architecture. It uses a couple of guidelines, in particular:\r \r - Replacing any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).\r - Using batchnorm in both the generator and the discriminator.\r - Removing fully connected hidden layers for deeper architectures.\r - Using [ReLU](https://paperswithcode.com/method/relu) activation in generator for all layers except for the output, which uses tanh.\r - Using LeakyReLU activation in the discriminator for all layer.""" ; skos:prefLabel "DCGAN" . :DCLS a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Dilated convolution with learnable spacings" ; skos:definition """Dilated convolution with learnable spacings (DCLS) is a type of convolution that allows the spacings between the non-zero elements of the kernel to be learned during training. This makes it possible to increase the receptive field of the convolution without increasing the number of parameters, which can improve the performance of the network on tasks that require long-range dependencies.\r \r A dilated convolution is a type of convolution that allows the kernel to be skipped over some of the input features. This is done by inserting zeros between the non-zero elements of the kernel. The effect of this is to increase the receptive field of the convolution without increasing the number of parameters.\r \r DCLS takes this idea one step further by allowing the spacings between the non-zero elements of the kernel to be learned during training. This means that the network can learn to skip over different input features depending on the task at hand. This can be particularly helpful for tasks that require long-range dependencies, such as image segmentation and object detection.\r \r DCLS has been shown to be effective for a variety of tasks, including image classification, object detection, and semantic segmentation. It is a promising new technique that has the potential to improve the performance of convolutional neural networks on a variety of tasks.""" ; skos:prefLabel "DCLS" . :DCN-V2 a skos:Concept ; dcterms:source ; skos:definition "**DCN-V2** is an architecture for learning-to-rank that improves upon the original [DCN](http://paperswithcode.com/method/dcn) model. It first learns explicit feature interactions of the inputs (typically the embedding layer) through cross layers, and then combines with a deep network to learn complementary implicit interactions. The core of DCN-V2 is the cross layers, which inherit the simple structure of the cross network from DCN, however it is significantly more expressive at learning explicit and bounded-degree cross features." ; skos:prefLabel "DCN-V2" . :DCNN a skos:Concept ; dcterms:source ; skos:altLabel "Diffusion-Convolutional Neural Networks" ; skos:definition """Diffusion-convolutional neural networks (DCNN) is a model for graph-structured data. Through the introduction of a diffusion-convolution operation, diffusion-based representations can be learned from graph structured data and used as an effective basis for node classification.\r \r Description and image from: [Diffusion-Convolutional Neural Networks](https://arxiv.org/pdf/1511.02136.pdf)""" ; skos:prefLabel "DCNN" . :DD-PPO a skos:Concept ; dcterms:source ; skos:altLabel "Decentralized Distributed Proximal Policy Optimization" ; skos:definition """**Decentralized Distributed Proximal Policy Optimization (DD-PPO)** is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement. \r \r Proximal Policy Optimization, or [PPO](https://paperswithcode.com/method/ppo), is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https://paperswithcode.com/method/trpo), while using only first-order optimization. \r \r Let $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a “surrogate” objective:\r \r $$ L^{v}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r \r As a general abstraction, DD-PPO implements the following:\r at step $k$, worker $n$ has a copy of the parameters, $\\theta^k_n$, calculates the gradient, $\\delta \\theta^k_n$, and updates $\\theta$ via \r \r $$ \\theta^{k+1}\\_n = \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\text{AllReduce}\\big(\\delta \\theta^k\\_1, \\ldots, \\delta \\theta^k\\_N\\big)\\Big) = \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\frac{1}{N} \\sum_{i=1}^{N} { \\delta \\theta^k_i} \\Big) $$\r \r where $\\text{ParamUpdate}$ is any first-order optimization technique (e.g. gradient descent) and $\\text{AllReduce}$ performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers.\r Distributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).""" ; skos:prefLabel "DD-PPO" . :DDPG a skos:Concept ; dcterms:source ; skos:altLabel "Deep Deterministic Policy Gradient" ; skos:definition "**DDPG**, or **Deep Deterministic Policy Gradient**, is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. It combines the actor-critic approach with insights from [DQNs](https://paperswithcode.com/method/dqn): in particular, the insights that 1) the network is trained off-policy with samples from a replay buffer to minimize correlations between samples, and 2) the network is trained with a target Q network to give consistent targets during temporal difference backups. DDPG makes use of the same ideas along with [batch normalization](https://paperswithcode.com/method/batch-normalization)." ; skos:prefLabel "DDPG" . :DDParser a skos:Concept ; dcterms:source ; skos:altLabel "Baidu Dependency Parser" ; skos:definition """**DDParser**, or **Baidu Dependency Parser**, is a Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB).\r \r For inputs, for the $i$ th word, its input vector $e_{i}$ is the concatenation of the word embedding and character-level representation:\r \r $$\r e\\_{i}=e\\_{i}^{w o r d} \\oplus C h a r L S T M\\left(w\\_{i}\\right)\r $$\r \r Where $\\operatorname{CharLSTM}\\left(w_{i}\\right)$ is the output vectors after feeding the character sequence into a [BiLSTM](https://paperswithcode.com/method/bilstm) layer. The experimental results on DuCTB dataset show that replacing POS tag embeddings with $\\operatorname{CharLSTM}\\left(w_{i}\\right)$ leads to the improvement.\r \r For the BiLSTM encoder, three BiLSTM layers are employed over the input vectors for context encoding. Denote $r\\_{i}$ the output vector of the top-layer BiLSTM for $w\\_{i}$\r \r The dependency parser of [Dozat and Manning](https://arxiv.org/abs/1611.01734) is used. Dimension-reducing MLPs are applied to each recurrent output vector $r\\_{i}$ before applying the biaffine transformation. Applying smaller MLPs to the recurrent output states before the biaffine classifier has the advantage of stripping away information not relevant to the current decision. Then biaffine attention is used both in the dependency arc classifier and relation classifier. The computations of all symbols in the Figure are shown below:\r \r $$\r h_{i}^{d-a r c}=M L P^{d-a r c}\\left(r_{i}\\right)\r $$\r $$\r h_{i}^{h-a r c}=M L P^{h-a r c}\\left(r_{i}\\right) \\\\\r $$\r $$\r h_{i}^{d-r e l}=M L P^{d-r e l}\\left(r_{i}\\right) \\\\\r $$\r $$\r h_{i}^{h-r e l}=M L P^{h-r e l}\\left(r_{i}\\right) \\\\\r $$\r $$\r S^{a r c}=\\left(H^{d-a r c} \\oplus I\\right) U^{a r c} H^{h-a r c} \\\\\r $$\r $$\r S^{r e l}=\\left(H^{d-r e l} \\oplus I\\right) U^{r e l}\\left(\\left(H^{h-r e l}\\right)^{T} \\oplus I\\right)^{T}\r $$\r \r For the decoder, the first-order Eisner algorithm is used to ensure that the output is a projection tree. Based on the dependency tree built by biaffine parser, we get a word sequence through the in-order traversal of the tree. The output is a projection tree only if the word sequence is in order.""" ; skos:prefLabel "DDParser" . :DDQL a skos:Concept ; dcterms:source ; skos:altLabel "Double Deep Q-Learning" ; skos:definition "" ; skos:prefLabel "DDQL" . :DDSP a skos:Concept ; dcterms:source ; skos:altLabel "Differentiable Digital Signal Processing" ; skos:definition "" ; skos:prefLabel "DDSP" . :DE-GAN a skos:Concept ; skos:altLabel "DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement" ; skos:definition """Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the\r performance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement\r Generative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images.\r To the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We\r demonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an\r enhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.""" ; skos:prefLabel "DE-GAN" . :DECA a skos:Concept ; dcterms:source ; skos:altLabel "Detailed Expression Capture and Animation" ; skos:definition "**Detailed Expression Capture and Animation**, or **DECA**, is a model for 3D face reconstruction that is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. A detail-consistency loss is used to disentangle person-specific details and expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged." ; skos:prefLabel "DECA" . :DELG a skos:Concept ; dcterms:source ; skos:definition """**DELG** is a convolutional neural network for image retrieval that combines generalized mean pooling for global features and attentive selection for local features. The entire network can be learned end-to-end by carefully balancing the gradient flow between two heads – requiring only image-level labels. This allows for efficient inference by extracting an image’s global feature, detected keypoints and local descriptors within a single model.\r \r The model is enabled by leveraging hierarchical image representations that arise in [CNNs](https://paperswithcode.com/methods/category/convolutional-neural-networks), which are coupled to [generalized mean pooling](https://paperswithcode.com/method/generalized-mean-pooling) and attentive local feature detection. Secondly, a convolutional autoencoder module is adopted that can successfully learn low-dimensional local descriptors. This can be readily integrated into the unified model, and avoids the need of post-processing learning steps, such as [PCA](https://paperswithcode.com/method/pca), that are commonly used. Finally, a procedure is used that enables end-to-end training of the proposed model using only image-level supervision. This requires carefully controlling the gradient flow between the global and local network heads during backpropagation, to avoid disrupting the desired representations.""" ; skos:prefLabel "DELG" . :DELU a skos:Concept ; dcterms:source ; skos:definition """The **DELU** is a type of activation function that has trainable parameters, uses the complex linear and exponential functions in the positive dimension and uses the **[SiLU](https://paperswithcode.com/method/silu)** in the negative dimension.\r \r $$DELU(x) = SiLU(x), x \\leqslant 0$$\r $$DELU(x) = (n + 0.5)x + |e^{-x} - 1|, x > 0$$""" ; skos:prefLabel "DELU" . :DEQ a skos:Concept ; dcterms:source ; skos:altLabel "Deep Equilibrium Models" ; skos:definition "A new kind of implicit models, where the output of the network is defined as the solution to an \"infinite-level\" fixed point equation. Thanks to this we can compute the gradient of the output without activations and therefore with a significantly reduced memory footprint." ; skos:prefLabel "DEQ" . :DEXTR a skos:Concept ; dcterms:source ; skos:altLabel "Deep Extreme Cut" ; skos:definition """**DEXTR**, or **Deep Extreme Cut**, obtains an object segmentation from its four extreme points: the left-most, right-most, top, and bottom pixels. The annotated extreme points are given as a guiding signal to the input of the network. To this end, we create a [heatmap](https://paperswithcode.com/method/heatmap) with activations in the regions of extreme points. We center a 2D Gaussian around each of the points, in order to create a single heatmap. The heatmap is concatenated with the RGB channels of the input image, to form a 4-channel input for the CNN. In order to focus on the object of interest, the input is cropped by the bounding box, formed from the extreme point annotations. To include context on the resulting\r crop, we relax the tight bounding box by several pixels. After the pre-processing step that comes exclusively from the extreme clicks, the input consists of an RGB crop including an object, plus its extreme points. \r \r [ResNet](https://paperswithcode.com/method/resnet)-101 is chosen as backbone of the architecture. We remove the fully connected layers as well as the [max pooling](https://paperswithcode.com/method/max-pooling) layers in the last two stages to preserve acceptable output resolution for dense prediction, and we introduce atrous convolutions in the last two stages to maintain the same receptive field. After the last ResNet-101 stage, we introduce a pyramid scene parsing module to aggregate global context to the final feature map. The output of the CNN is a probability map representing whether a pixel belongs to the object that we want to segment or not. The CNN is trained to minimize the standard cross entropy loss, which takes into account that different classes occur with different frequency in a dataset.""" ; skos:prefLabel "DEXTR" . :DExTra a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**DExTra**, or **Deep and Light-weight Expand-reduce Transformation**, is a light-weight expand-reduce transformation that enables learning wider representations efficiently.\r \r DExTra maps a $d\\_{m}$ dimensional input vector into a high dimensional space (expansion) and then\r reduces it down to a $d\\_{o}$ dimensional output vector (reduction) using $N$ layers of group transformations. During these expansion and reduction phases, DExTra uses group linear transformations because they learn local representations by deriving the output from a specific part of the input and are more efficient than linear transformations. To learn global representations, DExTra shares information between different groups in the group linear transformation using feature shuffling \r \r Formally, the DExTra transformation is controlled by five configuration parameters: (1) depth $N$, (2)\r width multiplier $m\\_{w}$, (3) input dimension $d\\_{m}$, (4) output dimension $d\\_{o}$, and (5) maximum groups $g\\_{max}$ in a group linear transformation. In the expansion phase, DExTra projects the $d\\_{m}$-dimensional input to a high-dimensional space, $d\\_{max} = m\\_{w}d\\_{m}$, linearly using $\\text{ceil}\\left(\\frac{N}{2}\\right)$ layers. In the reduction phase, DExTra projects the $d\\_{max}$-dimensional vector to a $d\\_{o}$-dimensional space using the remaining $N -\\text{ceil}\\left(\\frac{N}{2}\\right)$ layers. Mathematically, we define the output $Y$ at each layer $l$ as:\r \r $$ \\mathbf{Y}\\_{l} = \\mathcal{F}\\left(\\mathbf{X}, \\mathbf{W}^{l}, \\mathbf{b}^{l}, g^{l}\\right) \\text{ if } l=1 $$\r $$ \\mathbf{Y}\\_{l} = \\mathcal{F}\\left(\\mathcal{H}\\left(\\mathbf{X}, \\mathbf{Y}^{l-1}\\right), \\mathbf{W}^{l}, \\mathbf{b}^{l}, g^{l}\\right) \\text{ Otherwise } $$\r \r where the number of groups at each layer $l$ are computed as:\r \r $$ g^{l} = \\text{min}\\left(2^{l-1}, g\\_{max}\\right), 1 \\leq l \\leq \\text{ceil}\\left(N/2\\right) $$\r $$ g^{N-l}, \\text{Otherwise}$$\r \r In the above equations, $\\mathcal{F}$ is a group linear transformation function. The function $\\mathcal{F}$ takes the input $\\left(\\mathbf{X} \\text{ or } \\mathcal{H}\\left(\\mathbf{X}, \\mathbf{Y}^{l-1}\\right) \\right)$, splits it into $g^{l}$ groups, and then applies a linear transformation with learnable parameters $\\mathbf{W}^{l}$ and bias $\\mathbf{b}^{l}$ to each group independently. The outputs of each group are then concatenated to produce the final output $\\mathbf{Y}^{l}$. The function $\\mathcal{H}$ first shuffles the output of each group in $\\mathbf{Y}^{l−1}$ and then combines it with the input $\\mathbf{X}$ using an input mixer connection.\r \r In the authors' experiments, they use $g\\_{max} = \\text{ceil}\\left(\\frac{d\\_{m}}{32}\\right)$ so that each group has at least 32 input elements. Note that (i) group linear transformations reduce to linear transformations when $g^{l} = 1$, and (ii) DExTra is equivalent to a multi-layer perceptron when $g\\_{max} = 1$.""" ; skos:prefLabel "DExTra" . :DFA a skos:Concept ; dcterms:source ; skos:altLabel "Direct Feedback Alignment" ; skos:definition "" ; skos:prefLabel "DFA" . :DFDNet a skos:Concept ; dcterms:source ; skos:definition "**DFDNet**, or **DFDNet**, is a deep face dictionary network for face restoration to guide the restoration process of degraded observations. Given a LQ image $I\\_{d}$, the DFDNet selects the dictionary features that have the most similar structure with the input. Specially, we re-norm the whole dictionaries via component AdaIN (termed as CAdaIN) based on the input component to eliminate the distribution or style diversity. The selected dictionary features are then utilized to guide the restoration process via dictionary feature transformation." ; skos:prefLabel "DFDNet" . :DG-Net a skos:Concept ; dcterms:source ; skos:altLabel "Discriminative and Generative Network" ; skos:definition "" ; skos:prefLabel "DG-Net" . :DGCNN a skos:Concept ; dcterms:source ; skos:altLabel "Deep Graph Convolutional Neural Network" ; skos:definition """DGCNN involves neural networks that read the graphs directly and learn a classification function. There are two main challenges: 1) how to extract useful features characterizing the rich information encoded in a graph for classification purpose, and 2) how to sequentially read a graph in a meaningful and consistent order. To address the first challenge, we design a localized graph convolution model and show its connection with two graph kernels. To address the second challenge, we design a novel SortPooling layer which sorts graph vertices in a consistent order so that traditional neural networks can be trained on the graphs.\r \r Description and image from: [An End-to-End Deep Learning Architecture for Graph Classification](https://muhanzhang.github.io/papers/AAAI_2018_DGCNN.pdf)""" ; skos:prefLabel "DGCNN" . :DGI a skos:Concept ; dcterms:source ; skos:altLabel "Deep Graph Infomax" ; skos:definition """Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs—both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups.\r \r Description and image from: [DEEP GRAPH INFOMAX](https://arxiv.org/pdf/1809.10341.pdf)""" ; skos:prefLabel "DGI" . :DGRF a skos:Concept ; dcterms:source ; skos:altLabel "Difference of Gaussian Random Forest" ; skos:definition "" ; skos:prefLabel "DGRF" . :DIME a skos:Concept ; dcterms:source ; skos:altLabel "Distance to Modelled Embedding" ; skos:definition "**DIME**, or **Distance to Modelled Embedding**, is a method for detecting out-of-distribution examples during prediction time. Given a trained neural network, the training data drawn from some high-dimensional distribution in data space $X$ is transformed into the model’s intermediate feature vector space $\\mathbb{R}^{p}$. The training set embedding is linearly approximated as a hyperplane. When we then receive new observations it is difficult to assess if observations are out-of-distribution directly in data space, so we transform them into the same intermediate feature space. Finally, the Distance-to-Modelled-Embedding (DIME) can be used to assess whether new observations fit into the expected embedding covariance structure." ; skos:prefLabel "DIME" . :DINO a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "self-DIstillation with NO labels" ; skos:definition """**DINO** (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss. \r \r In the example to the right, DINO is illustrated in the case of one single pair of views $\\left(x\\_{1}, x\\_{2}\\right)$ for simplicity.\r The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters.\r The output of the teacher network is centered with a mean computed over the batch. Each network outputs a $K$ dimensional feature normalized with a temperature [softmax](https://paperswithcode.com/method/softmax) over the feature dimension.\r Their similarity is then measured with a cross-entropy loss.\r A stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student.\r The teacher parameters are updated with the student parameters' exponential moving average (ema).""" ; skos:prefLabel "DINO" . :DIoU-NMS a skos:Concept ; dcterms:source ; skos:definition """**DIoU-NMS** is a type of non-maximum suppression where we use Distance IoU rather than regular DIoU, in which the overlap area and the distance between two central points of bounding boxes are simultaneously considered when suppressing redundant boxes.\r \r In original NMS, the IoU metric is used to suppress the redundant detection boxes, where the overlap area is the unique factor, often yielding false suppression for the cases with occlusion. With DIoU-NMS, we not only consider the overlap area but also central point distance between two boxes.""" ; skos:prefLabel "DIoU-NMS" . :DLA a skos:Concept ; dcterms:source ; skos:altLabel "Deep Layer Aggregation" ; skos:definition """**DLA**, or **Deep Layer Aggregation**, iteratively and hierarchically merges the feature hierarchy across layers in neural networks to make networks with better accuracy and fewer parameters. \r \r In iterative deep aggregation (IDA), aggregation begins at the shallowest, smallest scale and then iteratively merges deeper,\r larger scales. In this way shallow features are refined as\r they are propagated through different stages of aggregation.\r \r In hierarchical deep aggregation (HDA), blocks and stages\r in a tree are merged to preserve and combine feature channels. With\r HDA shallower and deeper layers are combined to learn\r richer combinations that span more of the feature hierarchy.\r While IDA effectively combines stages, it is insufficient\r for fusing the many blocks of a network, as it is still only\r sequential.""" ; skos:prefLabel "DLA" . :DMA a skos:Concept ; dcterms:source ; skos:altLabel "Dual Multimodal Attention" ; skos:definition "In image inpainting task, the mechanism extracts complementary features from the word embedding in two paths by reciprocal attention, which is done by comparing the descriptive text and complementary image areas through reciprocal attention." ; skos:prefLabel "DMA" . :DMAGE a skos:Concept ; skos:altLabel "Unsupervised Deep Manifold Attributed Graph Embedding" ; skos:definition "Unsupervised attributed graph representation learning is challenging since both structural and feature information are required to be represented in the latent space. Existing methods concentrate on learning latent representation via reconstruction tasks, but cannot directly optimize representation and are prone to oversmoothing, thus limiting the applications on downstream tasks. To alleviate these issues, we propose a novel graph embedding framework named Deep Manifold Attributed Graph Embedding (DMAGE). A node-to-node geodesic similarity is proposed to compute the inter-node similarity between the data space and the latent space and then use Bergman divergence as loss function to minimize the difference between them. We then design a new network structure with fewer aggregation to alleviate the oversmoothing problem and incorporate graph structure augmentation to improve the representation's stability. Our proposed DMAGE surpasses state-of-the-art methods by a significant margin on three downstream tasks: unsupervised visualization, node clustering, and link prediction across four popular datasets." ; skos:prefLabel "DMAGE" . :DMVFN a skos:Concept ; dcterms:source ; skos:altLabel "A Dynamic Multi-Scale Voxel Flow Network" ; skos:definition "" ; skos:prefLabel "DMVFN" . :DNAS a skos:Concept ; dcterms:source ; skos:altLabel "Differentiable Neural Architecture Search" ; skos:definition """**DNAS**, or **Differentiable Neural Architecture Search**, uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. DNAS allows us to explore a layer-wise search space where we can choose a different block for each layer of the network. DNAS represents the search space by a super net whose operators execute stochastically. It relaxes the problem of finding the optimal architecture to find a distribution that yields the optimal architecture. By using the [Gumbel Softmax](https://paperswithcode.com/method/gumbel-softmax) technique, it is possible to directly train the architecture distribution using gradient-based optimization such as [SGD](https://paperswithcode.com/method/sgd).\r \r The loss used to train the stochastic super net consists of both the cross-entropy loss that leads to better accuracy and the latency loss that penalizes the network's latency on a target device. To estimate the latency of an architecture, the latency of each operator in the search space is measured and a lookup table model is used to compute the overall latency by adding up the latency of each operator. Using this model allows for estimation of the latency of architectures in an enormous search space. More importantly, it makes the latency differentiable with respect to layer-wise block choices.""" ; skos:prefLabel "DNAS" . :DNN2LR a skos:Concept ; dcterms:source ; skos:definition "**DNN2LR** is an automatic feature crossing method to find feature interactions in a deep neural network, and use them as cross features in logistic regression. In general, DNN2LR consists of two steps: (1) generating a compact and accurate candidate set of cross feature fields; (2) searching in the candidate set for the final cross feature fields." ; skos:prefLabel "DNN2LR" . :DOLG a skos:Concept ; dcterms:source ; skos:altLabel "Deep Orthogonal Fusion of Local and Global Features" ; skos:definition """Image Retrieval is a fundamental task of obtaining images similar to the query one from a database. A common image retrieval practice is to firstly retrieve candidate images via similarity search using global image features and then re-rank the candidates by leveraging their\r local features. Previous learning-based studies mainly focus on either global or local image representation learning\r to tackle the retrieval task. In this paper, we abandon the\r two-stage paradigm and seek to design an effective singlestage solution by integrating local and global information\r inside images into compact image representations. Specifically, we propose a Deep Orthogonal Local and Global\r (DOLG) information fusion framework for end-to-end image retrieval. It attentively extracts representative local information with multi-atrous convolutions and self-attention\r at first. Components orthogonal to the global image representation are then extracted from the local information.\r At last, the orthogonal components are concatenated with\r the global representation as a complementary, and then aggregation is performed to generate the final representation.\r The whole framework is end-to-end differentiable and can\r be trained with image-level labels. Extensive experimental\r results validate the effectiveness of our solution and show\r that our model achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets.""" ; skos:prefLabel "DOLG" . :DPG a skos:Concept ; skos:altLabel "Deterministic Policy Gradient" ; skos:definition "**Deterministic Policy Gradient**, or **DPG**, is a policy gradient method for reinforcement learning. Instead of the policy function $\\pi\\left(.\\mid{s}\\right)$ being modeled as a probability distribution, DPG considers and calculates gradients for a deterministic policy $a = \\mu\\_{theta}\\left(s\\right)$." ; skos:prefLabel "DPG" . :DPN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Dual Path Network" ; skos:definition """A **Dual Path Network (DPN)** is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that [ResNets](https://paperswithcode.com/method/resnet) enables feature re-usage while [DenseNet](https://paperswithcode.com/method/densenet) enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. \r \r We formulate such a dual path architecture as follows:\r \r $$x^{k} = \\sum\\limits\\_{t=1}^{k-1} f\\_t^{k}(h^t) \\text{,} $$\r \r $$\r y^{k} = \\sum\\limits\\_{t=1}^{k-1} v\\_t(h^t) = y^{k-1} + \\phi^{k-1}(y^{k-1}) \\text{,} \\\\\\\\\r $$\r \r $$\r r^{k} = x^{k} + y^{k} \\text{,} \\\\\\\\\r $$\r \r $$\r h^k = g^k \\left( r^{k} \\right) \\text{,}\r $$\r \r where $x^{k}$ and $y^{k}$ denote the extracted information at $k$-th step from individual path, $v_t(\\cdot)$ is a feature learning function as $f_t^k(\\cdot)$. The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.""" ; skos:prefLabel "DPN" . :DPNBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Dual Path Network** block is an image model block used in convolutional neural network. The idea of this module is to enable sharing of common features while maintaining the flexibility to explore new features through dual path architectures. In this sense it combines the benefits of [ResNets](https://paperswithcode.com/method/resnet) and [DenseNets](https://paperswithcode.com/method/densenet). It was proposed as part of the [DPN](https://paperswithcode.com/method/dpn) CNN architecture.\r \r We formulate such a dual path architecture as follows:\r \r $$x^{k} = \\sum\\limits\\_{t=1}^{k-1} f\\_t^{k}(h^t) \\text{,} $$\r \r $$\r y^{k} = \\sum\\limits\\_{t=1}^{k-1} v\\_t(h^t) = y^{k-1} + \\phi^{k-1}(y^{k-1}) \\text{,} \\\\\\\\\r $$\r \r $$\r r^{k} = x^{k} + y^{k} \\text{,} \\\\\\\\\r $$\r \r $$\r h^k = g^k \\left( r^{k} \\right) \\text{,}\r $$\r \r where $x^{k}$ and $y^{k}$ denote the extracted information at $k$-th step from individual path, $v_t(\\cdot)$ is a feature learning function as $f_t^k(\\cdot)$. The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.""" ; skos:prefLabel "DPN Block" . :DPT a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Dense Prediction Transformer" ; skos:definition """**Dense Prediction Transformers** (DPT) are a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) for dense prediction tasks.\r \r The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a [ResNet](https://paperswithcode.com/method/resnet)-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple [transformer](https://paperswithcode.com/method/transformer) stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.""" ; skos:prefLabel "DPT" . :DQN a skos:Concept ; dcterms:source ; skos:altLabel "Deep Q-Network" ; skos:definition """A **DQN**, or Deep Q-Network, approximates a state-value function in a [Q-Learning](https://paperswithcode.com/method/q-learning) framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output. \r \r It is usually used in conjunction with [Experience Replay](https://paperswithcode.com/method/experience-replay), for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random. Additionally, the Q-Network is usually optimized towards a frozen target network that is periodically updated with the latest weights every $k$ steps (where $k$ is a hyperparameter). The latter makes training more stable by preventing short-term oscillations from a moving target. The former tackles autocorrelation that would occur from on-line learning, and having a replay memory makes the problem more like a supervised learning problem.\r \r Image Source: [here](https://www.researchgate.net/publication/319643003_Autonomous_Quadrotor_Landing_using_Deep_Reinforcement_Learning)""" ; skos:prefLabel "DQN" . :DROID-SLAM a skos:Concept ; dcterms:source ; skos:definition "**DROID-SLAM** is a deep learning based SLAM system. It consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. This layer leverages geometric constraints, improves accuracy and robustness, and enables a monocular system to handle stereo or RGB-D input without retraining. It builds a dense 3D map of the environment while simultaneously localizing the camera within the map." ; skos:prefLabel "DROID-SLAM" . :DRPNN a skos:Concept ; dcterms:source ; skos:altLabel "Deep Residual Pansharpening Neural Network" ; skos:definition "In the field of fusing multi-spectral and panchromatic images (Pan-sharpening), the impressive effectiveness of deep neural networks has been recently employed to overcome the drawbacks of traditional linear models and boost the fusing accuracy. However, to the best of our knowledge, existing research works are mainly based on simple and flat networks with relatively shallow architecture, which severely limited their performances. In this paper, the concept of residual learning has been introduced to form a very deep convolutional neural network to make a full use of the high non-linearity of deep learning models. By both quantitative and visual assessments on a large number of high quality multi-spectral images from various sources, it has been supported that our proposed model is superior to all mainstream algorithms included in the comparison, and achieved the highest spatial-spectral unified accuracy." ; skos:prefLabel "DRPNN" . :DSAMloss a skos:Concept ; dcterms:source ; skos:altLabel "Distance Shrinking with Angular Marginalizing Loss" ; skos:definition "" ; skos:prefLabel "DSAM loss" . :DSGN a skos:Concept ; dcterms:source ; skos:altLabel "Deep Stereo Geometry Network" ; skos:definition """**Deep Stereo Geometry Network** is a 3D object detection pipeline that relies on space transformation from 2D features to an effective 3D structure, called 3D geometric volume (3DGV). The whole neural network consists of four components. (a) A 2D image\r feature extractor for capture of both pixel- and high-level feature. (b) Constructing the plane-sweep volume and 3D geometric volume. (c) Depth Estimation on the plane-sweep volume. (d) 3D object detection on 3D geometric volume.""" ; skos:prefLabel "DSGN" . :DSPT a skos:Concept ; dcterms:source ; skos:altLabel "double-stage parameter tuning" ; skos:definition "Parameter tuning method for neural network models with adaptive activation functions." ; skos:prefLabel "DSPT" . :DSelect-k a skos:Concept ; dcterms:source ; skos:definition """**DSelect-k** is a continuously differentiable and sparse gate for Mixture-of-experts (MoE), based on a novel binary encoding formulation. Given a user-specified parameter $k$, the gate selects at most $k$ out of the $n$ experts. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. This explicit control over sparsity leads to a cardinality-constrained optimization problem, which is computationally challenging. To circumvent this challenge, the authors use a unconstrained reformulation that is equivalent to the original problem. The reformulated problem uses a binary encoding scheme to implicitly enforce the cardinality constraint. By carefully smoothing the binary encoding variables, the reformulated problem can be effectively optimized using first-order methods such as [SGD](https://paperswithcode.com/method/sgd).\r \r The motivation for this method is that existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods.""" ; skos:prefLabel "DSelect-k" . :DTW a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Dynamic Time Warping" ; skos:definition """Dynamic Time Warping (DTW) [1] is one of well-known distance measures between a pairwise of time series. The main idea of DTW is to compute the distance from the matching of similar elements between time series. It uses the dynamic programming technique to find the optimal temporal matching between elements of two time series.\r \r For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a linear sequence can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching application.\r \r In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules:\r \r 1. Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa\r 2. The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match)\r 3. The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match)\r 4. The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j>i are indices from the first sequence, then there must not be two indices l>k in the other sequence, such that index i is matched with index l and index j is matched with index k, and vice versa.\r \r [1] Sakoe, Hiroaki, and Seibi Chiba. "Dynamic programming algorithm optimization for spoken word recognition." IEEE transactions on acoustics, speech, and signal processing 26, no. 1 (1978): 43-49.""" ; skos:prefLabel "DTW" . :DU-GAN a skos:Concept ; dcterms:source ; skos:definition "**DU-GAN** is a [generative adversarial network](https://www.paperswithcode.com/methods/category/generative-adversarial-networks) for LDCT denoising in medical imaging. The generator produces denoised LDCT images, and two independent branches with [U-Net](https://paperswithcode.com/method/u-net) based discriminators perform at the image and gradient domains. The U-Net based discriminator provides both global structure and local per-pixel feedback to the generator. Furthermore, the image discriminator encourages the generator to produce photo-realistic CT images while the gradient discriminator is utilized for better edge and alleviating streak artifacts caused by photon starvation." ; skos:prefLabel "DU-GAN" . :DV3AttentionBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DV3 Attention Block** is an attention-based module used in the [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3) architecture. It uses a [dot-product attention](https://paperswithcode.com/method/dot-product-attention) mechanism. A query vector (the hidden states of the decoder) and the per-timestep key vectors from the encoder are used to compute attention weights. This then outputs a context vector computed as the weighted average of the value vectors." ; skos:prefLabel "DV3 Attention Block" . :DV3ConvolutionBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DV3 Convolution Block** is a convolutional block used for the [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3) text-to-speech architecture. It consists of a 1-D [convolution](https://paperswithcode.com/method/convolution) with a gated linear unit and a [residual connection](https://paperswithcode.com/method/residual-connection). In the Figure, $c$ denotes the dimensionality of the input. The convolution output of size $2 \\cdot c$ is split into equal-sized portions: the gate vector and the input vector. A scaling factor $\\sqrt{0.5}$ is used to ensure that we preserve the input variance early in training. The gated linear unit provides a linear path for the gradient flow, which alleviates the vanishing gradient issue for stacked convolution blocks while retaining non-linearity. To introduce speaker-dependent control, a speaker-dependent embedding is added as a bias to the convolution filter output, after a softsign function. The authors use the softsign nonlinearity because it limits the range of the output while also avoiding the saturation problem that exponential based nonlinearities sometimes exhibit. Convolution filter weights are initialized with zero-mean and unit-variance activations throughout the entire network." ; skos:prefLabel "DV3 Convolution Block" . :DVD-GAN a skos:Concept ; dcterms:source ; skos:definition """**DVD-GAN** is a generative adversarial network for video generation built upon the [BigGAN](https://paperswithcode.com/method/biggan) architecture.\r \r DVD-GAN uses two discriminators: a Spatial Discriminator $\\mathcal{D}\\_{S}$ and a\r Temporal Discriminator $\\mathcal{D}\\_{T}$. $\\mathcal{D}\\_{S}$ critiques single frame content and structure by randomly sampling $k$ full-resolution frames and judging them individually. The temporal discriminator $\\mathcal{D}\\_{T}$ must provide $G$ with the learning signal to generate movement (not evaluated by $\\mathcal{D}\\_{S}$).\r \r The input to $G$ consists of a Gaussian latent noise $z \\sim N\\left(0, I\\right)$ and a learned linear embedding $e\\left(y\\right)$ of the desired class $y$. Both inputs are 120-dimensional vectors. $G$ starts by computing an affine transformation of $\\left[z; e\\left(y\\right)\\right]$ to a $\\left[4, 4, ch\\_{0}\\right]$-shaped tensor. $\\left[z; e\\left(y\\right)\\right]$ is used as the input to all class-[conditional Batch Normalization](https://paperswithcode.com/method/conditional-batch-normalization) layers\r throughout $G$. This is then treated as the input (at each frame we would like to generate) to a Convolutional [GRU](https://paperswithcode.com/method/gru).\r \r This RNN is unrolled once per frame. The output of this RNN is processed by two residual blocks. The time dimension is combined with the batch dimension here, so each frame proceeds through the blocks independently. The output of these blocks has width and height dimensions which\r are doubled (we skip upsampling in the first block). This is repeated a number of times, with the\r output of one RNN + residual group fed as the input to the next group, until the output tensors have\r the desired spatial dimensions. \r \r The spatial discriminator $\\mathcal{D}\\_{S}$ functions almost identically to BigGAN’s discriminator. A score is calculated for each of the uniformly sampled $k$ frames (default $k = 8$) and the $\\mathcal{D}\\_{S}$ output is the sum over per-frame scores. The temporal discriminator $\\mathcal{D}\\_{T}$ has a similar architecture, but pre-processes the real or generated video with a $2 \\times 2$ average-pooling downsampling function $\\phi$. Furthermore, the first two residual blocks of $\\mathcal{D}\\_{T}$ are 3-D, where every [convolution](https://paperswithcode.com/method/convolution) is replaced with a 3-D convolution with a kernel size of $3 \\times 3 \\times 3$. The rest of the architecture follows BigGAN.""" ; skos:prefLabel "DVD-GAN" . :DVD-GANDBlock a skos:Concept ; dcterms:source ; skos:definition "**DVD-GAN DBlock** is a residual block for the discriminator used in the [DVD-GAN](https://paperswithcode.com/method/dvd-gan) architecture for video generation. Unlike regular [residual blocks](https://paperswithcode.com/method/residual-block), [3D convolutions](https://paperswithcode.com/method/3d-convolution) are employed due to the application to multiple frames in a video." ; skos:prefLabel "DVD-GAN DBlock" . :DVD-GANGBlock a skos:Concept ; dcterms:source ; skos:definition "**DVD-GAN GBlock** is a [residual block](https://paperswithcode.com/method/residual-block) for the generator used in the [DVD-GAN](https://paperswithcode.com/method/dvd-gan) architecture for video generation." ; skos:prefLabel "DVD-GAN GBlock" . :Darknet-19 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Darknet-19** is a convolutional neural network that is used as the backbone of [YOLOv2](https://paperswithcode.com/method/yolov2). Similar to the [VGG](https://paperswithcode.com/method/vgg) models it mostly uses $3 \\times 3$ filters and doubles the number of channels after every pooling step. Following the work on Network in Network (NIN) it uses [global average pooling](https://paperswithcode.com/method/global-average-pooling) to make predictions as well as $1 \\times 1$ filters to compress the feature representation between $3 \\times 3$ convolutions. [Batch Normalization](https://paperswithcode.com/method/batch-normalization) is used to stabilize training, speed up convergence, and regularize the model batch." ; skos:prefLabel "Darknet-19" . :Darknet-53 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Darknet-53** is a convolutional neural network that acts as a backbone for the [YOLOv3](https://paperswithcode.com/method/yolov3) object detection approach. The improvements upon its predecessor [Darknet-19](https://paperswithcode.com/method/darknet-19) include the use of residual connections, as well as more layers." ; skos:prefLabel "Darknet-53" . :DeBERTa a skos:Concept ; dcterms:source ; skos:definition "**DeBERTa** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based neural language model that aims to improve the [BERT](https://paperswithcode.com/method/bert) and [RoBERTa](https://paperswithcode.com/method/roberta) models with two techniques: a [disentangled attention mechanism](https://paperswithcode.com/method/disentangled-attention-mechanism) and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output [softmax](https://paperswithcode.com/method/softmax) layer to predict the masked tokens for model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve model’s generalization on downstream tasks." ; skos:prefLabel "DeBERTa" . :DeCLUTR a skos:Concept ; dcterms:source ; skos:definition "**DeCLUTR** is an approach for learning universal sentence embeddings that utilizes a self-supervised objective that does not require labelled training data. The objective learns universal sentence embeddings by training an encoder to minimize the distance between the embeddings of textual segments randomly sampled from nearby in the same document." ; skos:prefLabel "DeCLUTR" . :DeLighT a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DeLiGHT** is a [transformer](https://paperswithcode.com/method/transformer) architecture that delivers parameter efficiency improvements by (1) within each Transformer block using [DExTra](https://paperswithcode.com/method/dextra), a deep and light-weight transformation, allowing for the use of [single-headed attention](https://paperswithcode.com/method/single-headed-attention) and bottleneck FFN layers and (2) across blocks using block-wise scaling, that allows for shallower and narrower [DeLighT blocks](https://paperswithcode.com/method/delight-block) near the input and wider and deeper DeLighT blocks near the output." ; skos:prefLabel "DeLighT" . :DeLighTBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **DeLighT Block** is a block used in the [DeLighT](https://paperswithcode.com/method/delight) [transformer](https://paperswithcode.com/method/transformer) architecture. It uses a [DExTra](https://paperswithcode.com/method/dextra) transformation to reduce the dimensionality of the vectors entered into the attention layer, where a [single-headed attention](https://paperswithcode.com/method/single-headed-attention) module is used. Since the DeLighT block learns wider representations of the input across different layers using DExTra, it enables the authors to replace [multi-head attention](https://paperswithcode.com/method/multi-head-attention) with single-head attention. This is then followed by a light-weight FFN which, rather than expanding the dimension (as in normal Transformers which widen to a dimension 4x the size), imposes a bottleneck and squeezes the dimensions. Again, the reason for this is that the DExTra transformation has already incorporated wider representations so we can squeeze instead at this layer." ; skos:prefLabel "DeLighT Block" . :DeactivableSkipConnection a skos:Concept ; dcterms:source ; skos:definition """A **Deactivable Skip Connection** is a type of skip connection which, instead of concatenating the encoder features\r (red) and decoder features (blue), as with [standard skip connections](https://paperswithcode.com/methods/category/skip-connections), it instead fuses the encoder features with part of the decoder features (light blue), to be able to deactivate this operation when needed.""" ; skos:prefLabel "Deactivable Skip Connection" . :DecorrelatedBatchNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Decorrelated Batch Normalization (DBN)** \r is a normalization technique which not just centers and scales activations but whitens them. ZCA whitening instead of [PCA](https://paperswithcode.com/method/pca) whitening is employed since PCA whitening causes a problem called *stochastic axis swapping*, which is detrimental to learning.""" ; skos:prefLabel "Decorrelated Batch Normalization" . :DeeBERT a skos:Concept ; dcterms:source ; skos:definition "**DeeBERT** is a method for accelerating [BERT](https://paperswithcode.com/method/bert) inference. It inserts extra classification layers (which are referred to as off-ramps) between each [transformer](https://paperswithcode.com/method/transformer) layer of BERT. All transformer layers and off-ramps are jointly fine-tuned on a given downstream dataset. At inference time, after a sample goes through a transformer layer, it is passed to the following off-ramp. If the off-ramp is confident of the prediction, the result is returned; otherwise, the sample is sent to the next transformer layer." ; skos:prefLabel "DeeBERT" . :Deep-CAPTCHA a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Deep-CAPTCHA" . :Deep-MAC a skos:Concept ; dcterms:source ; skos:definition """**Deep-MAC**, or **Deep Mask-heads Above CenterNet**, is a type of anchor-free instance segmentation model based on [CenterNet](https://paperswithcode.com/method/centernet). The motivation for this new architecture is that boxes are much cheaper to annotate than masks, so the authors address the “partially supervised” instance segmentation problem, where all classes have bounding box annotations but only a subset of classes have mask annotations. \r \r For predicting bounding boxes, CenterNet outputs 3 tensors: (1) a class-specific [heatmap](https://paperswithcode.com/method/heatmap) which indicates the probability of the center of a bounding box being present at each location, (2) a class-agnostic 2-channel tensor indicating the height and width of the bounding box at each center pixel, and (3) since the output feature map is typically smaller than the image (stride 4 or 8), CenterNet also predicts an x and y direction offset to recover this discretization error at each center pixel.\r \r For Deep-MAC, in parallel to the box-related prediction heads, we add a fourth pixel embedding branch $P$. For each bounding box\r $b$, we crop a region $P\\_{b}$ from $P$ corresponding to $b$ via [ROIAlign](https://paperswithcode.com/method/roi-align) which results in a 32 × 32 tensor. We then feed each $P\\_{b}$ to a mask-head. The final prediction at the end is a class-agnostic 32 × 32 tensor which we pass through a sigmoid to get per-pixel probabilities. We train this mask-head via a per-pixel cross-entropy loss averaged over all pixels and instances. During post-processing, the predicted mask is re-aligned according to the predicted box and resized to the resolution of the image. \r \r In addition to this 32 × 32 cropped feature map, we add two inputs for improved stability of some mask-heads: (1) Instance embedding: an additional head is added to the backbone that predicts a per-pixel embedding. For each bounding box $b$ we extract its embedding from the center pixel. This embedding is tiled to a size of 32 × 32 and concatenated to the pixel embedding crop. This helps condition the mask-head on a particular instance and disambiguate it from others. (2) Coordinate Embedding: Inspired by [CoordConv](https://paperswithcode.com/method/coordconv), the authors add a 32 × 32 × 2 tensor holding normalized $\\left(x, y\\right)$ coordinates relative to the bounding box $b$.""" ; skos:prefLabel "Deep-MAC" . :DeepBeliefNetwork a skos:Concept ; skos:definition """A **Deep Belief Network (DBN)** is a multi-layer generative graphical model. DBNs have bi-directional connections ([RBM](https://paperswithcode.com/method/restricted-boltzmann-machine)-type connections) on the top layer while the bottom layers only have top-down connections. They are trained using layerwise pre-training. Pre-training occurs by training the network component by component bottom up: treating the first two layers as an RBM and training, then treating the second layer and third layer as another RBM and training for those parameters.\r \r Source: [Origins of Deep Learning](https://arxiv.org/pdf/1702.07800.pdf)\r \r Image Source: [Wikipedia](https://en.wikipedia.org/wiki/Deep_belief_network)""" ; skos:prefLabel "Deep Belief Network" . :DeepBoltzmannMachine a skos:Concept ; skos:definition """A **Deep Boltzmann Machine (DBM)** is a three-layer generative model. It is similar to a [Deep Belief Network](https://paperswithcode.com/method/deep-belief-network), but instead allows bidirectional connections in the bottom layers. Its energy function is as an extension of the energy function of the RBM:\r \r $$ E\\left(v, h\\right) = -\\sum^{i}\\_{i}v\\_{i}b\\_{i} - \\sum^{N}\\_{n=1}\\sum_{k}h\\_{n,k}b\\_{n,k}-\\sum\\_{i, k}v\\_{i}w\\_{ik}h\\_{k} - \\sum^{N-1}\\_{n=1}\\sum\\_{k,l}h\\_{n,k}w\\_{n, k, l}h\\_{n+1, l}$$\r \r for a DBM with $N$ hidden layers.\r \r Source: [On the Origin of Deep Learning](https://arxiv.org/pdf/1702.07800.pdf)""" ; skos:prefLabel "Deep Boltzmann Machine" . :DeepCluster a skos:Concept ; dcterms:source ; skos:definition """**DeepCluster** is a self-supervision approach for learning image representations. DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update\r the weights of the network""" ; skos:prefLabel "DeepCluster" . :DeepDrug a skos:Concept ; dcterms:source ; skos:definition "**DeepDrug** is a deep learning framework to overcome these shortcomings by using graph convolutional networks to learn the graphical representations of drugs and proteins such as molecular fingerprints and residual structures in order to boost the prediction accuracy." ; skos:prefLabel "DeepDrug" . :DeepEnsembles a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Deep Ensembles" . :DeepIR a skos:Concept ; dcterms:source ; skos:definition "**DeepIR**, or **Deep InfraRed image processing**, is a thermal image processing framework for recovering high quality images from a very small set of images captured with camera motion. Enhancement is achieved by noting that camera motion, which is usually a hinderance, can be exploited to our advantage to separate a sequence of images into the scene-dependent radiant flux, and a slowly changing scene-independent non-uniformity. DeepIR combines the physics of microbolometer sensors, with powerful regularization capabilities by neural network-based representations. DeepIR relies on the key observation that jittering a camera, while unwanted in visible domain, is highly desirable in the thermal domain as it allows an accurate separation of the sensor-specific non-uniformities from the scene’s radiant flux." ; skos:prefLabel "DeepIR" . :DeepLSTMReader a skos:Concept ; dcterms:source ; skos:definition """The **Deep LSTM Reader** is a neural network for reading comprehension. We feed documents one word at a time into a Deep [LSTM](https://paperswithcode.com/method/lstm) encoder, after a delimiter we then also feed the query into the encoder. The model therefore processes each document query pair as a single long sequence. Given the embedded document and query the network predicts which token in the document answers the query.\r \r The model consists of a Deep LSTM cell with skip connections from each input $x\\left(t\\right)$ to every hidden layer, and from every hidden layer to the output $y\\left(t\\right)$:\r \r $$x'\\left(t, k\\right) = x\\left(t\\right)||y'\\left(t, k - 1\\right) \\text{, } y\\left(t\\right) = y'\\left(t, 1\\right)|| \\dots ||y'\\left(t, K\\right) $$\r \r $$ i\\left(t, k\\right) = \\left(W\\_{kxi}x'\\left(t, k\\right) + W\\_{khi}h(t - 1, k) + W\\_{kci}c\\left(t - 1, k\\right) + b\\_{ki}\\right) $$\r \r $$ f\\left(t, k\\right) = \\left(W\\_{kxf}x\\left(t\\right) + W\\_{khf}h\\left(t - 1, k\\right) + W\\_{kcf}c\\left(t - 1, k\\right) + b\\_{kf}\\right) $$\r \r $$ c\\left(t, k\\right) = f\\left(t, k\\right)c\\left(t - 1, k\\right) + i\\left(t, k\\right)\\text{tanh}\\left(W\\_{kxc}x'\\left(t, k\\right) + W\\_{khc}h\\left(t - 1, k\\right) + b\\_{kc}\\right) $$\r \r $$ o\\left(t, k\\right) = \\left(W\\_{kxo}x'\\left(t, k\\right) + W\\_{kho}h\\left(t - 1, k\\right) + W\\_{kco}c\\left(t, k\\right) + b\\_{ko}\\right) $$\r \r $$ h\\left(t, k\\right) = o\\left(t, k\\right)\\text{tanh}\\left(c\\left(t, k\\right)\\right) $$\r \r $$ y'\\left(t, k\\right) = W\\_{kyh}\\left(t, k\\right) + b\\_{ky} $$\r \r where || indicates vector concatenation, $h\\left(t, k\\right)$ is the hidden state for layer $k$ at time $t$, and $i$, $f$, $o$ are the input, forget, and output gates respectively. Thus our Deep LSTM Reader is defined by $g^{\\text{LSTM}}\\left(d, q\\right) = y\\left(|d|+|q|\\right)$ with input $x\\left(t\\right)$ the concatenation of $d$ and $q$ separated by the delimiter |||.""" ; skos:prefLabel "Deep LSTM Reader" . :DeepLab a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DeepLab** is a semantic segmentation architecture. First, the input image goes through the network with the use of dilated convolutions. Then the output from the network is bilinearly interpolated and goes through the fully connected [CRF](https://paperswithcode.com/method/crf) to fine tune the result we obtain the final predictions." ; skos:prefLabel "DeepLab" . :DeepLabv2 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DeepLabv2** is an architecture for semantic segmentation that build on [DeepLab](https://paperswithcode.com/method/deeplab) with an atrous [spatial pyramid pooling](https://paperswithcode.com/method/spatial-pyramid-pooling) scheme. Here we have parallel dilated convolutions with different rates applied in the input feature map, which are then fused together. As objects of the same class can have different sizes in the image, [ASPP](https://paperswithcode.com/method/aspp) helps to account for different object sizes." ; skos:prefLabel "DeepLabv2" . :DeepLabv3 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**DeepLabv3** is a semantic segmentation architecture that improves upon [DeepLabv2](https://paperswithcode.com/method/deeplabv2) with several modifications. To handle the problem of segmenting objects at multiple scales, modules are designed which employ atrous [convolution](https://paperswithcode.com/method/convolution) in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, the Atrous [Spatial Pyramid Pooling](https://paperswithcode.com/method/spatial-pyramid-pooling) module from DeepLabv2 augmented with image-level features encoding global context and further boost performance. \r \r The changes to the ASSP module are that the authors apply [global average pooling](https://paperswithcode.com/method/global-average-pooling) on the last feature map of the model, feed the resulting image-level features to a 1 × 1 convolution with 256 filters (and [batch normalization](https://paperswithcode.com/method/batch-normalization)), and then bilinearly upsample the feature to the desired spatial dimension. In the\r end, the improved [ASPP](https://paperswithcode.com/method/aspp) consists of (a) one 1×1 convolution and three 3 × 3 convolutions with rates = (6, 12, 18) when output stride = 16 (all with 256 filters and batch normalization), and (b) the image-level features.\r \r Another interesting difference is that DenseCRF post-processing from DeepLabv2 is no longer needed.""" ; skos:prefLabel "DeepLabv3" . :DeepMask a skos:Concept ; dcterms:source ; skos:definition """**DeepMask** is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network\r layers are specialized for separately outputting a mask and score prediction.""" ; skos:prefLabel "DeepMask" . :DeepSIM a skos:Concept ; dcterms:source ; skos:definition "**DeepSIM** is a generative model for conditional image manipulation based on a single image. The network learns to map between a primitive representation of the image to the image itself. At manipulation time, the generator allows for making complex image changes by modifying the primitive input representation and mapping it through the network. The choice of a primitive representations has an impact on the ease and expressiveness of the manipulations and can be automatic (e.g. edges), manual, or hybrid such as edges on top of segmentations." ; skos:prefLabel "DeepSIM" . :DeepViT a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DeepViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that replaces the self-attention layer within the [transformer](https://paperswithcode.com/method/transformer) block with a [Re-attention module](https://paperswithcode.com/method/re-attention-module) to address the issue of attention collapse and enables training deeper ViTs." ; skos:prefLabel "DeepViT" . :DeepVoice3 a skos:Concept ; dcterms:source ; skos:definition """**Deep Voice 3 (DV3)** is a fully-convolutional attention-based neural text-to-speech system. The Deep Voice 3 architecture consists of three components:\r \r - Encoder: A fully-convolutional encoder, which converts textual features to an internal\r learned representation.\r \r - Decoder: A fully-convolutional causal decoder, which decodes the learned representation\r with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.\r \r - Converter: A fully-convolutional post-processing network, which predicts final vocoder\r parameters (depending on the vocoder choice) from the decoder hidden states. Unlike the\r decoder, the converter is non-causal and can thus depend on future context information.\r \r The overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The authors separate decoder and converter and apply multi-task training, because it makes attention learning easier in practice. To be specific, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction besides vocoder parameter prediction.""" ; skos:prefLabel "Deep Voice 3" . :DeepWalk a skos:Concept ; dcterms:source ; skos:definition """**DeepWalk** learns embeddings (social representations) of a graph's vertices, by modeling a stream of short random walks. Social representations are latent features of the vertices that capture neighborhood similarity and community membership. These latent representations encode social relations in a continuous vector space with a relatively small number of dimensions. It generalizes neural language models to process a special language composed of a set of randomly-generated walks. \r \r The goal is to learn a latent representation, not only a probability distribution of node co-occurrences, and so as to introduce a mapping function $\\Phi \\colon v \\in V \\mapsto \\mathbb{R}^{|V|\\times d}$.\r This mapping $\\Phi$ represents the latent social representation associated with each vertex $v$ in the graph. In practice, $\\Phi$ is represented by a $|V| \\times d$ matrix of free parameters.""" ; skos:prefLabel "DeepWalk" . :Deflation a skos:Concept ; dcterms:source ; skos:definition "**Deflation** is a video-to-image operation to transform a video network into a network that can ingest a single image. In the two types of video networks considered in the original paper, this deflation corresponds to the following operations: for [3D convolutional based networks](https://paperswithcode.com/method/3d-convolution), summing the 3D spatio-temporal filters over the temporal dimension to obtain 2D filters; for TSM networks,, turning off the channel shifting which results in a standard [residual architecture](https://paperswithcode.com/method/resnet) (ResNet50) for images." ; skos:prefLabel "Deflation" . :DeformableAttentionModule a skos:Concept ; dcterms:source ; skos:definition """**Deformable Attention Module** is an attention module used in the [Deformable DETR](https://paperswithcode.com/method/deformable-detr) architecture, which seeks to overcome one issue base [Transformer attention](https://paperswithcode.com/method/scaled) in that it looks over all possible spatial locations. Inspired by [deformable convolution](https://paperswithcode.com/method/deformable-convolution), the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.\r \r Given an input feature map $x \\in \\mathbb{R}^{C \\times H \\times W}$, let $q$ index a query element with content feature $\\mathbf{z}\\_{q}$ and a 2-d reference point $\\mathbf{p}\\_{q}$, the deformable attention feature is calculated by:\r \r $$ \\text{DeformAttn}\\left(\\mathbf{z}\\_{q}, \\mathbf{p}\\_{q}, \\mathbf{x}\\right)=\\sum\\_{m=1}^{M} \\mathbf{W}\\_{m}\\left[\\sum\\_{k=1}^{K} A\\_{m q k} \\cdot \\mathbf{W}\\_{m}^{\\prime} \\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)\\right]\r $$\r \r where $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total sampled key number $(K \\ll H W) . \\Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of the $k^{\\text {th }}$ sampling point in the $m^{\\text {th }}$ attention head, respectively. The scalar attention weight $A_{m q k}$ lies in the range $[0,1]$, normalized by $\\sum_{k=1}^{K} A_{m q k}=1 . \\Delta \\mathbf{p}_{m q k} \\in \\mathbb{R}^{2}$ are of 2-d real numbers with unconstrained range. As $p\\_{q}+\\Delta p\\_{m q k}$ is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing $\\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)$. Both $\\Delta \\mathbf{p}\\_{m q k}$ and $A\\_{m q k}$ are obtained via linear projection over the query feature $z\\_{q} .$ In implementation, the query feature $z\\_{q}$ is fed to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\\Delta p\\_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator to obtain the attention weights $A\\_{m q k}$.""" ; skos:prefLabel "Deformable Attention Module" . :DeformableConvNets a skos:Concept ; dcterms:source ; skos:altLabel "Deformable Convolutional Networks" ; skos:definition """Deformable ConvNets do not learn an affine transformation. They divide convolution into two steps, firstly sampling features on a regular grid $ \\mathcal{R} $ from the input feature map, then aggregating sampled features by weighted summation using a convolution kernel. The process can be written as:\r \\begin{align}\r Y(p_{0}) &= \\sum_{p_i \\in \\mathcal{R}} w(p_{i}) X(p_{0} + p_{i})\r \\end{align}\r \\begin{align}\r \\mathcal{R} &= \\{(-1,-1), (-1, 0), \\dots, (1, 1)\\}\r \\end{align}\r The deformable convolution augments the sampling process by introducing a group of learnable offsets $\\Delta p_{i}$ which can be generated by a lightweight CNN. Using the offsets $\\Delta p_{i}$, the deformable convolution can be formulated as:\r \\begin{align}\r Y(p_{0}) &= \\sum_{p_i \\in \\mathcal{R}} w(p_{i}) X(p_{0} + p_{i} + \\Delta p_{i}). \r \\end{align}\r Through the above method, adaptive sampling is achieved.\r However, $\\Delta p_{i}$ is a floating point value\r unsuited to grid sampling. \r To address this problem, bilinear interpolation is used. Deformable RoI pooling is also used, which greatly improves object detection. \r \r Deformable ConvNets adaptively select the important regions and enlarge the valid receptive field of convolutional neural networks; this is important in object detection and semantic segmentation tasks.""" ; skos:prefLabel "Deformable ConvNets" . :DeformableConvolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Deformable convolutions** add 2D offsets to the regular grid sampling locations in the standard [convolution](https://paperswithcode.com/method/convolution). It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner." ; skos:prefLabel "Deformable Convolution" . :DeformableDETR a skos:Concept ; dcterms:source ; skos:definition """**Deformable DETR** is an object detection method that aims mitigates the slow convergence and high complexity issues of [DETR](https://www.paperswithcode.com/method/detr). It combines the best of the sparse spatial sampling of [deformable convolution](https://paperswithcode.com/method/deformable-convolution), and the relation modeling capability of [Transformers](https://paperswithcode.com/methods/category/transformers). Specifically, it introduces a \r deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of [FPN](https://paperswithcode.com/method/fpn).""" ; skos:prefLabel "Deformable DETR" . :DeformableKernel a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Deformable Kernels** is a type of convolutional operator for deformation modeling. DKs learn free-form offsets on kernel coordinates to deform the original kernel space towards specific data modality, rather than recomposing data. This can directly adapt the effective receptive field (ERF) while leaving the receptive field untouched. They can be used as a drop-in replacement of rigid kernels. \r \r As shown in the Figure, for each input patch, a local DK first generates a group of kernel offsets $\\{\\Delta \\mathcal{k}\\}$ from input feature patch using the light-weight generator $\\mathcal{G}$ (a 3$\\times$3 [convolution](https://paperswithcode.com/method/convolution) of rigid kernel). Given the original kernel weights $\\mathcal{W}$ and the offset group $\\{\\Delta \\mathcal{k}\\}$, DK samples a new set of kernel $\\mathcal{W}'$ using a bilinear sampler $\\mathcal{B}$. Finally, DK convolves the input feature map and the sampled kernels to complete the whole computation.""" ; skos:prefLabel "Deformable Kernel" . :DeformablePosition-SensitiveRoIPooling a skos:Concept ; dcterms:source ; skos:definition "**Deformable Position-Sensitive RoI Pooling** is similar to PS RoI Pooling but it adds an offset to each bin position in the regular bin partition. Offset learning follows the “fully convolutional” spirit. In the top branch, a convolutional layer generates the full spatial resolution offset fields. For each RoI (also for each class), PS RoI pooling is applied on such fields to obtain normalized offsets, which are then transformed to the real offsets, in the same way as in deformable RoI pooling." ; skos:prefLabel "Deformable Position-Sensitive RoI Pooling" . :DeformableRoIPooling a skos:Concept ; dcterms:source ; skos:definition "**Deformable RoI Pooling** adds an offset to each bin position in the regular bin partition of the RoI Pooling. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes." ; skos:prefLabel "Deformable RoI Pooling" . :DeiT a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Data-efficient Image Transformer" ; skos:definition "A **Data-Efficient Image Transformer** is a type of [Vision Transformer](https://paperswithcode.com/method/vision-transformer) for image classification tasks. The model is trained using a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention." ; skos:prefLabel "DeiT" . :DeltaConv a skos:Concept ; dcterms:source ; skos:definition "Anisotropic convolution is a central building block of CNNs but challenging to transfer to surfaces. DeltaConv learns combinations and compositions of operators from vector calculus, which are a natural fit for curved surfaces. The result is a simple and robust anisotropic convolution operator for point clouds with state-of-the-art results." ; skos:prefLabel "DeltaConv" . :Demon a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Decaying Momentum**, or **Demon**, is a stochastic optimizer motivated by decaying the total contribution of a gradient to all future updates. By decaying the momentum parameter, the total contribution of a gradient to all future updates is decayed. A particular gradient term $g\\_{t}$ contributes a total of $\\eta\\sum\\_{i}\\beta^{i}$ of its "energy" to all future gradient updates, and this results in the geometric sum, $\\sum^{\\infty}\\_{i=1}\\beta^{i} = \\beta\\sum^{\\infty}\\_{i=0}\\beta^{i} = \\frac{\\beta}{\\left(1-\\beta\\right)}$. Decaying this sum results in the Demon algorithm. Letting $\\beta\\_{init}$ be the initial $\\beta$; then at the current step $t$ with total $T$ steps, the decay routine is given by solving the below for $\\beta\\_{t}$:\r \r $$ \\frac{\\beta\\_{t}}{\\left(1-\\beta\\_{t}\\right)} = \\left(1-t/T\\right)\\beta\\_{init}/\\left(1-\\beta\\_{init}\\right)$$\r \r Where $\\left(1-t/T\\right)$ refers to the proportion of iterations remaining. Note that Demon typically requires no hyperparameter tuning as it is usually decayed to $0$ or a small negative value at time \r $T$. Improved performance is observed by delaying the decaying. Demon can be applied to any gradient descent algorithm with a momentum parameter.""" ; skos:prefLabel "Demon" . :DemonADAM a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Demon Adam** is a stochastic optimizer where the [Demon](https://paperswithcode.com/method/demon) momentum rule is applied to the [Adam](https://paperswithcode.com/method/adam) optimizer.\r \r $$ \\beta\\_{t} = \\beta\\_{init}\\cdot\\frac{\\left(1-\\frac{t}{T}\\right)}{\\left(1-\\beta\\_{init}\\right) + \\beta\\_{init}\\left(1-\\frac{t}{T}\\right)} $$\r \r $$ m\\_{t, i} = g\\_{t, i} + \\beta\\_{t}m\\_{t-1, i} $$\r \r $$ v\\_{t+1} = \\beta\\_{2}v\\_{t} + \\left(1-\\beta\\_{2}\\right)g^{2}\\_{t} $$\r \r $$ \\theta_{t} = \\theta_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon} $$""" ; skos:prefLabel "Demon ADAM" . :DemonCM a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Demon CM**, or **SGD with Momentum and Demon**, is the [Demon](https://paperswithcode.com/method/demon) momentum rule applied to [SGD with momentum](https://paperswithcode.com/method/sgd-with-momentum).\r \r $$ \\beta\\_{t} = \\beta\\_{init}\\cdot\\frac{\\left(1-\\frac{t}{T}\\right)}{\\left(1-\\beta\\_{init}\\right) + \\beta\\_{init}\\left(1-\\frac{t}{T}\\right)} $$\r \r $$ \\theta\\_{t+1} = \\theta\\_{t} - \\eta{g}\\_{t} + \\beta\\_{t}v\\_{t} $$\r \r $$ v\\_{t+1} = \\beta\\_{t}{v\\_{t}} - \\eta{g\\_{t}} $$""" ; skos:prefLabel "Demon CM" . :DenoisedSmoothing a skos:Concept ; dcterms:source ; skos:definition "**Denoised Smoothing** is a method for obtaining a provably robust classifier from a fixed pretrained one, without any additional training or fine-tuning of the latter. The basic idea is to prepend a custom-trained denoiser before the pretrained classifier, and then apply randomized smoothing. Randomized smoothing is a certified defense that converts any given classifier $f$ into a new smoothed classifier $g$ that is characterized by a non-linear Lipschitz property. When queried at a point $x$, the smoothed classifier $g$ outputs the class that is most likely to be returned by $f$ under isotropic Gaussian perturbations of its inputs. Unfortunately, randomized smoothing requires that the underlying classifier $f$ is robust to relatively large random Gaussian perturbations of the input, which is not the case for off-the-shelf pretrained models. By applying our custom-trained denoiser to the classifier $f$, we can effectively make $f$ robust to such Gaussian perturbations, thereby making it “suitable” for randomized smoothing." ; skos:prefLabel "Denoised Smoothing" . :DenoisingAutoencoder a skos:Concept ; skos:definition """A **Denoising Autoencoder** is a modification on the [autoencoder](https://paperswithcode.com/method/autoencoder) to prevent the network learning the identity function. Specifically, if the autoencoder is too big, then it can just learn the data, so the output equals the input, and does not perform any useful representation learning or dimensionality reduction. Denoising autoencoders solve this problem by corrupting the input data on purpose, adding noise or masking some of the input values.\r \r Image Credit: [Kumar et al](https://www.semanticscholar.org/paper/Static-hand-gesture-recognition-using-stacked-Kumar-Nandi/5191ddf3f0841c89ba9ee592a2f6c33e4a40d4bf)""" ; skos:prefLabel "Denoising Autoencoder" . :DenoisingScoreMatching a skos:Concept ; dcterms:source ; skos:definition "Training a denoiser on signals gives you a powerful prior over this signal that you can then use to sample examples of this signal." ; skos:prefLabel "Denoising Score Matching" . :DenseBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Dense Block** is a module used in convolutional neural networks that connects *all layers* (with matching feature-map sizes) directly with each other. It was originally proposed as part of the [DenseNet](https://paperswithcode.com/method/densenet) architecture. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. In contrast to [ResNets](https://paperswithcode.com/method/resnet), we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the $\\ell^{th}$ layer has $\\ell$ inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all $L-\\ell$ subsequent layers. This introduces $\\frac{L(L+1)}{2}$ connections in an $L$-layer network, instead of just $L$, as in traditional architectures: \"dense connectivity\"." ; skos:prefLabel "Dense Block" . :DenseConnections a skos:Concept ; skos:definition """**Dense Connections**, or **Fully Connected Connections**, are a type of layer in a deep neural network that use a linear operation where every input is connected to every output by a weight. This means there are $n\\_{\\text{inputs}}*n\\_{\\text{outputs}}$ parameters, which can lead to a lot of parameters for a sizeable network.\r \r $$h\\_{l} = g\\left(\\textbf{W}^{T}h\\_{l-1}\\right)$$\r \r where $g$ is an activation function.\r \r Image Source: Deep Learning by Goodfellow, Bengio and Courville""" ; skos:prefLabel "Dense Connections" . :DenseContrastiveLearning a skos:Concept ; dcterms:source ; skos:definition "**Dense Contrastive Learning** is a self-supervised learning method for dense prediction tasks. It implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Contrasting with regular contrastive loss, the contrastive loss is computed between the single feature vectors outputted by the global projection head, at the level of global feature, while the dense contrastive loss is computed between the dense feature vectors outputted by the dense projection head, at the level of local feature." ; skos:prefLabel "Dense Contrastive Learning" . :DenseNAS a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DenseNAS** is a [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method that utilises a densely connected search space. The search space is represented as a dense super network, which is built upon designed routing blocks. In the super network, routing blocks are densely connected and we search for the best path between them to derive the final architecture. A chained cost estimation algorithm is used to approximate the model cost during the search." ; skos:prefLabel "DenseNAS" . :DenseNAS-A a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DenseNAS-A** is a mobile convolutional neural network discovered through the [DenseNAS](https://paperswithcode.com/method/densenas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the MobileNet architectures." ; skos:prefLabel "DenseNAS-A" . :DenseNAS-B a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DenseNAS-B** is a mobile convolutional neural network discovered through the [DenseNAS](https://paperswithcode.com/method/densenas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the [MobileNet](https://paperswithcode.com/method/mobilenetv2) architectures." ; skos:prefLabel "DenseNAS-B" . :DenseNAS-C a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DenseNAS-C** is a mobile convolutional neural network discovered through the [DenseNAS](https://paperswithcode.com/method/densenas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the [MobileNet](https://paperswithcode.com/method/mobilenetv2) architectures." ; skos:prefLabel "DenseNAS-C" . :DenseNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **DenseNet** is a type of convolutional neural network that utilises [dense connections](https://paperswithcode.com/method/dense-connections) between layers, through [Dense Blocks](http://www.paperswithcode.com/method/dense-block), where we connect *all layers* (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers." ; skos:prefLabel "DenseNet" . :DenseNet-Elastic a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DenseNet-Elastic** is a convolutional neural network that is a modification of a [DenseNet](https://paperswithcode.com/method/densenet) with elastic blocks (extra upsampling and downsampling)." ; skos:prefLabel "DenseNet-Elastic" . :DenseSynthesizedAttention a skos:Concept ; dcterms:source ; skos:definition """**Dense Synthesized Attention**, introduced with the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture, is a type of synthetic attention mechanism that replaces the notion of [query-key-values](https://paperswithcode.com/method/scaled) in the self-attention module and directly synthesizes the alignment matrix instead. Dense attention is conditioned on each input token. The method accepts an input $X \\in \\mathbb{R}^{l\\text{ x }d}$ and produces an output of $Y \\in \\mathbb{R}^{l\\text{ x }d}$. Here $l$ refers to the sequence length and $d$ refers to the dimensionality of the model. We first adopt $F\\left(.\\right)$, a parameterized function, for projecting input $X\\_{i}$ from $d$ dimensions to $l$ dimensions.\r \r $$B\\_{i} = F\\left(X\\_{i}\\right)$$\r \r where $F\\left(.\\right)$ is a parameterized function that maps $\\mathbb{R}^{d}$ to $\\mathbb{R}^{l}$ and $i$ is the $i$-th token of $X$. Intuitively, this can be interpreted as learning a token-wise projection to the sequence length $l$. Essentially, with this model, each token predicts weights for each token in the input sequence. In practice, a simple two layered feed-forward layer with [ReLU](https://paperswithcode.com/method/relu) activations for $F\\left(.\\right)$ is adopted:\r \r $$ F\\left(X\\right) = W\\left(\\sigma\\_{R}\\left(W(X) + b\\right)\\right) + b$$\r \r where $\\sigma\\_{R}$ is the ReLU activation function. Hence, $B$ is now of $\\mathbb{R}^{l\\text{ x }d}$. Given $B$, we now compute:\r \r $$ Y = \\text{Softmax}\\left(B\\right)G\\left(X\\right) $$\r \r where $G\\left(.\\right)$ is another parameterized function of $X$ that is analogous to $V$ (value) in the standard [Transformer](https://paperswithcode.com/method/transformer) model. This approach eliminates the [dot product](https://paperswithcode.com/method/scaled) altogether by replacing $QK^{T}$ in standard Transformers with the synthesizing function $F\\left(.\\right)$.""" ; skos:prefLabel "Dense Synthesized Attention" . :Depth-wisePlaneSweeping a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Depth-wise Plane Sweeping" . :DepthwiseConvolution a skos:Concept ; skos:definition """**Depthwise Convolution** is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D [convolution](https://paperswithcode.com/method/convolution) performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. To summarize the steps, we:\r \r 1. Split the input and filter into channels.\r 2. We convolve each input with the respective filter.\r 3. We stack the convolved outputs together.\r \r Image Credit: [Chi-Feng Wang](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)""" ; skos:prefLabel "Depthwise Convolution" . :DepthwiseDilatedSeparableConvolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Depthwise Dilated Separable Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) that combines [depthwise separability](https://paperswithcode.com/method/depthwise-separable-convolution) with the use of [dilated convolutions](https://paperswithcode.com/method/dilated-convolution)." ; skos:prefLabel "Depthwise Dilated Separable Convolution" . :DepthwiseFireModule a skos:Concept ; dcterms:source ; skos:definition "A **Depthwise Fire Module** is a modification of a [Fire Module](https://paperswithcode.com/method/fire-module) with depthwise separable convolutions to improve the inference time performance. It is used in the [CornerNet](https://paperswithcode.com/method/cornernet)-Lite architecture for object detection." ; skos:prefLabel "Depthwise Fire Module" . :DepthwiseSeparableConvolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """While [standard convolution](https://paperswithcode.com/method/convolution) performs the channelwise and spatial-wise computation in one step, **Depthwise Separable Convolution** splits the computation into two steps: [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution) applies a single convolutional filter per each input channel and [pointwise convolution](https://paperswithcode.com/method/pointwise-convolution) is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown to the right.\r \r Credit: [Depthwise Convolution Is All You Need for Learning Multiple Visual Domains](https://paperswithcode.com/paper/depthwise-convolution-is-all-you-need-for)""" ; skos:prefLabel "Depthwise Separable Convolution" . :DetNAS a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DetNAS** is a [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) algorithm for the design of better backbones for object detection. It is based on the technique of one-shot supernet, which contains all possible networks in the search space. The supernet is trained under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. DetNAS uses evolutionary search as opposed to RL-based methods or gradient-based methods." ; skos:prefLabel "DetNAS" . :DetNASNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DetNASNet** is a convolutional neural network designed to be an object detection backbone and discovered through [DetNAS](https://paperswithcode.com/method/detnas) architecture search. It uses [ShuffleNet V2](https://paperswithcode.com/method/shufflenet-v2) blocks as its basic building block." ; skos:prefLabel "DetNASNet" . :DetNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DetNet** is a backbone convolutional neural network for object detection. Different from traditional pre-trained models for ImageNet classification, DetNet maintains the spatial resolution of the features even though extra stages are included. DetNet attempts to stay efficient by employing a low complexity dilated bottleneck structure." ; skos:prefLabel "DetNet" . :Detr a skos:Concept ; dcterms:source ; skos:altLabel "Detection Transformer" ; skos:definition """**Detr**, or **Detection Transformer**, is a set-based object detector using a [Transformer](https://paperswithcode.com/method/transformer) on top of a convolutional backbone. It uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class\r and bounding box) or a “no object” class.""" ; skos:prefLabel "Detr" . :DiCENet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DiCENet** is a convolutional neural network architecture that utilizes dimensional convolutions (and dimension-wise fusion). The dimension-wise convolutions apply light-weight convolutional filtering across each dimension of the input tensor while dimension-wise fusion efficiently combines these dimension-wise representations; allowing the [DiCE Unit](https://paperswithcode.com/method/dice-unit) in the network to efficiently encode spatial and channel-wise information contained in the input tensor." ; skos:prefLabel "DiCENet" . :DiCEUnit a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **DiCE Unit** is an image model block that is built using dimension-wise convolutions and dimension-wise fusion. The dimension-wise convolutions apply light-weight convolutional filtering across each dimension of the input tensor while dimension-wise fusion efficiently combines these dimension-wise representations; allowing the DiCE unit to efficiently encode spatial and channel-wise information contained in the input tensor. \r \r Standard convolutions encode spatial and channel-wise information simultaneously, but they are computationally expensive. To improve the efficiency of standard convolutions, separable [convolution](https://paperswithcode.com/method/convolution) are introduced, where spatial and channelwise information are encoded separately using depth-wise and point-wise convolutions, respectively. Though this factorization is effective, it puts a significant computational load on point-wise convolutions and makes them a computational bottleneck.\r \r DiCE Units utilize a dimension-wise convolution to encode depth-wise, width-wise, and height-wise information independently. The dimension-wise convolutions encode local information from different dimensions of the input tensor, but do not capture global information. One approach is a [pointwise convolution](https://paperswithcode.com/method/pointwise-convolution), but it is computationally expensive, so instead dimension-wise fusion factorizes the point-wise convolution in two steps: (1) local fusion and (2) global fusion.""" ; skos:prefLabel "DiCE Unit" . :DiceLoss a skos:Concept ; dcterms:source ; skos:definition """\\begin{equation}\r DiceLoss\\left( y, \\overline{p} \\right) = 1 - \\dfrac{\\left( 2y\\overline{p} + 1 \\right)} {\\left( y+\\overline{p } + 1 \\right)}\r \\end{equation}""" ; skos:prefLabel "Dice Loss" . :DiffAugment a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Differentiable Augmentation (DiffAugment)** is a set of differentiable image transformations used to augment data during [GAN](https://paperswithcode.com/method/gan) training. The transformations are applied to the real and generated images. It enables the gradients to be propagated through the augmentation back to the generator, regularizes\r the discriminator without manipulating the target distribution, and maintains the balance of training\r dynamics. Three choices of transformation are preferred by the authors in their experiments: Translation, [CutOut](https://paperswithcode.com/method/cutout), and Color.""" ; skos:prefLabel "DiffAugment" . :DiffPool a skos:Concept ; dcterms:source ; skos:definition """DiffPool is a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer.\r \r Description and image from: [Hierarchical Graph Representation Learning with Differentiable Pooling](https://arxiv.org/pdf/1806.08804.pdf)""" ; skos:prefLabel "DiffPool" . :DifferNet a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "DifferNet" . :DifferentiableHyperparameterSearch a skos:Concept ; dcterms:source ; skos:definition "Differentiable simultaneous optimization of hyperparameters and neural network architecture. Also a [Neural Architecture Search](https://paperswithcode.com/method/neural-architecture-search) (NAS) method." ; skos:prefLabel "Differentiable Hyperparameter Search" . :DifferentiableNAS a skos:Concept ; dcterms:source ; skos:altLabel "Differentiable Neural Architecture Search" ; skos:definition "" ; skos:prefLabel "Differentiable NAS" . :DifferentialDiffusion a skos:Concept ; dcterms:source ; skos:definition "**Differential Diffusion** is an enhancement of image-to-image diffusion models that adds the ability to control the amount of change applied to each image fragment via a change map." ; skos:prefLabel "Differential Diffusion" . :Differentialattentionforvisualquestionanswering a skos:Concept ; skos:definition "In this paper we aim to answer questions based on images when provided with a dataset of question-answer pairs for a number of images during training. A number of methods have focused on solving this problem by using image based attention. This is done by focusing on a specific part of the image while answering the question. Humans also do so when solving this problem. However, the regions that the previous systems focus on are not correlated with the regions that humans focus on. The accuracy is limited due to this drawback. In this paper, we propose to solve this problem by using an exemplar based method. We obtain one or more supporting and opposing exemplars to obtain a differential attention region. This differential attention is closer to human attention than other image based attention methods. It also helps in obtaining improved accuracy when answering questions. The method is evaluated on challenging benchmark datasets. We perform better than other image based attention methods and are competitive with other state of the art methods that focus on both image and questions." ; skos:prefLabel "Differential attention for visual question answering" . :Diffusion a skos:Concept ; dcterms:source ; skos:definition """Diffusion models generate samples by gradually\r removing noise from a signal, and their training objective can be expressed as a reweighted variational lower-bound (https://arxiv.org/abs/2006.11239).""" ; skos:prefLabel "Diffusion" . :DilatedBottleneckBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Dilated Bottleneck Block** is an image model block used in the [DetNet](https://paperswithcode.com/method/detnet) convolutional neural network architecture. It employs a bottleneck structure with dilated convolutions to efficiently enlarge the receptive field." ; skos:prefLabel "Dilated Bottleneck Block" . :DilatedBottleneckwithProjectionBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Dilated Bottleneck with Projection Block** is an image model block used in the [DetNet](https://paperswithcode.com/method/detnet) convolutional neural network architecture. It employs a bottleneck structure with dilated convolutions to efficiently enlarge the receptive field. It uses a [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) to ensure the spatial size stays fixed." ; skos:prefLabel "Dilated Bottleneck with Projection Block" . :DilatedCausalConvolution a skos:Concept ; dcterms:source ; skos:definition "A **Dilated Causal Convolution** is a [causal convolution](https://paperswithcode.com/method/causal-convolution) where the filter is applied over an area larger than its length by skipping input values with a certain step. A dilated causal [convolution](https://paperswithcode.com/method/convolution) effectively allows the network to have very large receptive fields with just a few layers." ; skos:prefLabel "Dilated Causal Convolution" . :DilatedConvolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Dilated Convolutions** are a type of [convolution](https://paperswithcode.com/method/convolution) that “inflate” the kernel by inserting holes between the kernel elements. An additional parameter $l$ (dilation rate) indicates how much the kernel is widened. There are usually $l-1$ spaces inserted between kernel elements. \r \r Note that concept has existed in past literature under different names, for instance the *algorithme a trous*, an algorithm for wavelet decomposition (Holschneider et al., 1987; Shensa, 1992).""" ; skos:prefLabel "Dilated Convolution" . :DilatedSlidingWindowAttention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Dilated Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https://paperswithcode.com/method/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https://paperswithcode.com/method/transformer) formulation has a [self-attention component](https://paperswithcode.com/method/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r \r Compared to a [Sliding Window Attention](https://paperswithcode.com/method/sliding-window-attention) pattern, we can further increase the receptive field without increasing computation by making the sliding window "dilated". This is analogous to [dilated CNNs](https://paperswithcode.com/method/dilated-convolution) where the window has gaps of size dilation $d$. Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l × d × w$, which can reach tens of thousands of tokens even for small values of $d$.""" ; skos:prefLabel "Dilated Sliding Window Attention" . :DimConv a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Dimension-wise Convolution" ; skos:definition """A **Dimension-wise Convolution**, or **DimConv**, is a type of [convolution](https://paperswithcode.com/method/convolution) that can encode depth-wise, width-wise, and height-wise information independently. To achieve this, DimConv extends depthwise convolutions to all dimensions of the input tensor $X \\in \\mathbb{R}^{D\\times{H}\\times{W}}$, where $W$, $H$, and $D$ corresponds to width, height, and depth of $X$. DimConv has three branches, one branch per dimension. These branches apply $D$ depth-wise convolutional kernels $k\\_{D} \\in \\mathbb{R}^{1\\times{n}\\times{n}}$ along depth, $W$ width-wise convolutional kernels $k\\_{W} \\in \\mathbb{R}^{n\\times{1}\\times{1}}$ along width, and $H$ height-wise convolutional kernels $k\\_{H} \\in \\mathbb{R}^{n\\times{1}\\times{n}}$ kernels along height\r to produce outputs $Y\\_{D}$, $Y\\_{W}$, and $Y\\_{H} \\in \\mathbb{R}^{D\\times{H}\\times{W}}$ that\r encode information from all dimensions of the input tensor. The outputs of these independent branches are concatenated along the depth dimension, such that the first spatial plane of $Y\\_{D}$, $Y\\_{W}$, and $Y\\_{H}$ are put together and so on, to produce the output $Y\\_{Dim} = ${$Y\\_{D}$, $Y\\_{W}$, $Y\\_{H}$} $\\in \\mathbb{R}^{3D\\times{H}\\times{W}}$.""" ; skos:prefLabel "DimConv" . :DimFuse a skos:Concept ; dcterms:source ; skos:altLabel "Dimension-wise Fusion" ; skos:definition "**Dimension-wise Fusion** is an image model block that attempts to capture global information by combining features globally. It is an alternative to point-wise [convolution](https://paperswithcode.com/method/convolution). A point-wise convolutional layer applies $D$ point-wise kernels $\\mathbf{k}\\_p \\in \\mathbb{R}^{3D \\times 1 \\times 1}$ and performs $3D^2HW$ operations to combine dimension-wise representations of $\\mathbf{Y_{Dim}} \\in \\mathbb{R}^{3D \\times H \\times W}$ and produce an output $\\mathbf{Y} \\in \\mathbb{R}^{D \\times H \\times W}$. This is computationally expensive. Dimension-wise fusion is an alternative that can allow us to combine representations of $\\mathbf{Y\\_{Dim}}$ efficiently. As illustrated in the Figure to the right, it factorizes the point-wise convolution in two steps: (1) local fusion and (2) global fusion." ; skos:prefLabel "DimFuse" . :DirectionalSparseFiltering a skos:Concept ; dcterms:source ; skos:altLabel "Directional Sparse FIltering" ; skos:definition "" ; skos:prefLabel "Directional Sparse Filtering" . :DiscreteCosineTransform a skos:Concept ; skos:definition """**Discrete Cosine Transform (DCT)** is an orthogonal transformation method that decomposes an\r image to its spatial frequency spectrum. It expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. It is used a lot in compression tasks, e..g image compression where for example high-frequency components can be discarded. It is a type of Fourier-related Transform, similar to discrete fourier transforms (DFTs), but only using real numbers.\r \r Image Credit: [Wikipedia](https://en.wikipedia.org/wiki/Discrete_cosine_transform#/media/File:Example_dft_dct.svg)""" ; skos:prefLabel "Discrete Cosine Transform" . :DiscriminativeAdversarialSearch a skos:Concept ; dcterms:source ; skos:definition "**Discriminative Adversarial Search**, or **DAS**, is a sequence decoding approach which aims to alleviate the effects of exposure bias and to optimize on the data distribution itself rather than for external metrics. Inspired by generative adversarial networks (GANs), wherein a discriminator is used to improve the generator, DAS differs from GANs in that the generator parameters are not updated at training time and the discriminator is only used to drive sequence generation at inference time." ; skos:prefLabel "Discriminative Adversarial Search" . :DiscriminativeFine-Tuning a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Discriminative Fine-Tuning** is a fine-tuning strategy that is used for [ULMFiT](https://paperswithcode.com/method/ulmfit) type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent ([SGD](https://paperswithcode.com/method/sgd)) update of a model’s parameters $\\theta$ at time step $t$ looks like the following (Ruder, 2016):\r \r $$ \\theta\\_{t} = \\theta\\_{t-1} − \\eta\\cdot\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r \r where $\\eta$ is the learning rate and $\\nabla\\_{\\theta}J\\left(\\theta\\right)$ is the gradient with regard to the model’s objective function. For discriminative fine-tuning, we split the parameters $\\theta$ into {$\\theta\\_{1}, \\ldots, \\theta\\_{L}$} where $\\theta\\_{l}$ contains the parameters of the model at the $l$-th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\\eta\\_{1}, \\ldots, \\eta\\_{L}$} where $\\theta\\_{l}$ where $\\eta\\_{l}$ is the learning rate of the $l$-th layer. The SGD update with discriminative finetuning is then:\r \r $$ \\theta\\_{t}^{l} = \\theta\\_{t-1}^{l} - \\eta^{l}\\cdot\\nabla\\_{\\theta^{l}}J\\left(\\theta\\right) $$\r \r The authors find that empirically it worked well to first choose the learning rate $\\eta^{L}$ of the last layer by fine-tuning only the last layer and using $\\eta^{l-1}=\\eta^{l}/2.6$ as the learning rate for lower layers.""" ; skos:prefLabel "Discriminative Fine-Tuning" . :DiscriminativeRegularization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Discriminative Regularization** is a regularization technique for [variational autoencoders](https://paperswithcode.com/methods/category/likelihood-based-generative-models) that uses representations from discriminative classifiers to augment the [VAE](https://paperswithcode.com/method/vae) objective function (the lower bound) corresponding to a generative model. Specifically, it encourages the model’s reconstructions to be close to the data example in a representation space defined by the hidden layers of highly-discriminative, neural network based classifiers." ; skos:prefLabel "Discriminative Regularization" . :DisentangledAttentionMechanism a skos:Concept ; dcterms:source ; skos:definition "**Disentangled Attention Mechanism** is an attention mechanism used in the [DeBERTa](https://paperswithcode.com/method/deberta) architecture. Unlike [BERT](https://paperswithcode.com/method/bert) where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences." ; skos:prefLabel "Disentangled Attention Mechanism" . :DisentangledAttributionCurves a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Disentangled Attribution Curves (DAC)** provide interpretations of tree ensemble methods in the form of (multivariate) feature importance curves. For a given variable, or group of variables, [DAC](https://paperswithcode.com/method/dac) plots the importance of a variable(s) as their value changes.\r \r The Figure to the right shows an example. The tree depicts a decision tree which performs binary classification using two features (representing the XOR function). In this problem, knowing the value of one of the features without knowledge of the other feature yields no information - the classifier still has a 50% chance of predicting either class. As a result, DAC produces curves which assign 0 importance to either feature on its own. Knowing both features yields perfect information about the classifier, and thus the DAC curve for both features together correctly shows that the interaction of the features produces the model’s predictions.""" ; skos:prefLabel "Disentangled Attribution Curves" . :DispR-CNN a skos:Concept ; dcterms:source ; skos:definition "**Disp R-CNN** is a 3D object detection system for stereo images. It utilizes an instance disparity estimation network (iDispNet) that predicts disparity only for pixels on objects of interest and learns a category-specific shape prior for more accurate disparity estimation. To address the challenge from scarcity of disparity annotation in training, a statistical shape model is used to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds." ; skos:prefLabel "Disp R-CNN" . :DistDGL a skos:Concept ; dcterms:source ; skos:definition "**DistDGL** is a system for training GNNs in a mini-batch fashion on a cluster of machines. It is is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight mincut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability" ; skos:prefLabel "DistDGL" . :DistanceNet a skos:Concept ; dcterms:source ; skos:definition "**DistanceNet** is a learning algorithm for multi-source domain adaptation that uses various distance measures, or a mixture of these distance measures, as an additional loss function to be minimized jointly with the task's loss function, so as to achieve better unsupervised domain adaptation." ; skos:prefLabel "DistanceNet" . :DistilBERT a skos:Concept ; dcterms:source ; skos:definition "**DistilBERT** is a small, fast, cheap and light [Transformer](https://paperswithcode.com/method/transformer) model based on the [BERT](https://paperswithcode.com/method/bert) architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses." ; skos:prefLabel "DistilBERT" . :DistributedShampoo a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A scalable second order optimization algorithm for deep learning.\r \r Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.""" ; skos:prefLabel "Distributed Shampoo" . :DistributionalGeneralization a skos:Concept ; dcterms:source ; skos:definition "**Distributional Generalization** is a type of generalization that roughly states that outputs of a classifier at train and test time are close as distributions, as opposed to close in just their average error. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain." ; skos:prefLabel "Distributional Generalization" . :Dorylus a skos:Concept ; dcterms:source ; skos:definition """**Dorylus** is a distributed system for training graph neural networks which uses cheap CPU servers and Lambda threads. It scales to\r large billion-edge graphs with low-cost cloud resources.""" ; skos:prefLabel "Dorylus" . :Dot-ProductAttention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Dot-Product Attention** is an attention mechanism where the alignment score function is calculated as: \r \r $$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = h\\_{i}^{T}s\\_{j}$$\r \r It is equivalent to [multiplicative attention](https://paperswithcode.com/method/multiplicative-attention) (without a trainable weight matrix, assuming this is instead an identity matrix). Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. \r \r Within a neural network, once we have the alignment scores, we calculate the final scores/weights using a [softmax](https://paperswithcode.com/method/softmax) function of these alignment scores (ensuring it sums to 1).""" ; skos:prefLabel "Dot-Product Attention" . :DouZero a skos:Concept ; dcterms:source ; skos:definition "**DouZero** is an AI system for the card game DouDizhu that enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. The [Q-network](https://paperswithcode.com/method/dqn) of DouZero consists of an [LSTM](https://paperswithcode.com/method/lstm) to encode historical actions and six layers of [MLP](https://paperswithcode.com/method/feedforward-network) with hidden dimension of 512. The network predicts a value for a given state-action pair based on the concatenated representation of action and state." ; skos:prefLabel "DouZero" . :DoubleDQN a skos:Concept ; dcterms:source ; skos:definition """A **Double Deep Q-Network**, or **Double DQN** utilises [Double Q-learning](https://paperswithcode.com/method/double-q-learning) to reduce overestimation by decomposing the max operation in the target into action selection and action evaluation. We evaluate the greedy policy according to the online network, but we use the target network to estimate its value. The update is the same as for [DQN](https://paperswithcode.com/method/dqn), but replacing the target $Y^{DQN}\\_{t}$ with:\r \r $$ Y^{DoubleDQN}\\_{t} = R\\_{t+1}+\\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\theta\\_{t}\\right);\\theta\\_{t}^{-}\\right) $$\r \r Compared to the original formulation of Double [Q-Learning](https://paperswithcode.com/method/q-learning), in Double DQN the weights of the second network $\\theta^{'}\\_{t}$ are replaced with the weights of the target network $\\theta\\_{t}^{-}$ for the evaluation of the current greedy policy.""" ; skos:prefLabel "Double DQN" . :DoubleQ-learning a skos:Concept ; dcterms:source ; skos:definition """**Double Q-learning** is an off-policy reinforcement learning algorithm that utilises double estimation to counteract overestimation problems with traditional Q-learning. \r \r The max operator in standard [Q-learning](https://paperswithcode.com/method/q-learning) and [DQN](https://paperswithcode.com/method/dqn) uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation, which is the idea behind Double Q-learning:\r \r $$ Y^{Q}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}\\_{t}\\right) $$\r \r The Double Q-learning error can then be written as:\r \r $$ Y^{DoubleQ}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}^{'}\\_{t}\\right) $$\r \r Here the selection of the action in the $\\arg\\max$ is still due to the online weights $\\theta\\_{t}$. But we use a second set of weights $\\mathbb{\\theta}^{'}\\_{t}$ to fairly evaluate the value of this policy.\r \r Source: [Deep Reinforcement Learning with Double Q-learning](https://paperswithcode.com/paper/deep-reinforcement-learning-with-double-q)""" ; skos:prefLabel "Double Q-learning" . :DraftingNetwork a skos:Concept ; dcterms:source ; skos:definition """**Drafting Network** is a style transfer module designed to transfer global style patterns in low-resolution, since global patterns can be transferred easier in low resolution due to larger receptive field and less local details. To achieve single style transfer, earlier work trained an encoder-decoder module, where only the content image is used as input. To better combine the style feature and the content feature, the Drafting Network adopts the [AdaIN module](https://paperswithcode.com/method/adaptive-instance-normalization).\r \r The architecture of Drafting Network is shown in the Figure, which includes an encoder, several AdaIN modules and a decoder. (1) The encoder is a pre-trained [VGG](https://paperswithcode.com/method/vgg)-19 network, which is fixed during training. Given $\\bar{x}\\_{c}$ and $\\bar{x}\\_{s}$, the VGG encoder extracts features in multiple granularity at 2_1, 3_1 and 4_1 layers. (2) Then, we apply feature modulation between the content and style feature using AdaIN modules after 2_1, 3_1 and 4_1 layers, respectively. (3) Finally, in each granularity of decoder, the corresponding feature from the AdaIN module is merged via a [skip-connection](https://paperswithcode.com/methods/category/skip-connections). Here, skip-connections after AdaIN modules in both low and high levels are leveraged to help to reserve content structure, especially for low-resolution image.""" ; skos:prefLabel "Drafting Network" . :Dreamix a skos:Concept ; dcterms:source ; skos:altLabel "Dreamix: video diffusion models are general video editors" ; skos:definition "" ; skos:prefLabel "Dreamix" . :DropAttack a skos:Concept ; dcterms:source ; skos:definition "**DropAttack** is an adversarial training method that adds intentionally worst-case adversarial perturbations to both the input and hidden layers in different dimensions and minimizes the adversarial risks generated by each layer." ; skos:prefLabel "DropAttack" . :DropBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**DropBlock** is a structured form of [dropout](https://paperswithcode.com/method/dropout) directed at regularizing convolutional networks. In DropBlock, units in a contiguous region of a feature map are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data." ; skos:prefLabel "DropBlock" . :DropPath a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """Just as [dropout](https://paperswithcode.com/method/dropout) prevents co-adaptation of activations, **DropPath** prevents co-adaptation of parallel paths in networks such as [FractalNets](https://paperswithcode.com/method/fractalnet) by randomly dropping operands of the join layers. This\r discourages the network from using one input path as an anchor and another as a corrective term (a\r configuration that, if not prevented, is prone to overfitting). Two sampling strategies are:\r \r - **Local**: a join drops each input with fixed probability, but we make sure at least one survives.\r - **Global**: a single path is selected for the entire network. We restrict this path to be a single\r column, thereby promoting individual columns as independently strong predictors.""" ; skos:prefLabel "DropPath" . :DropPathway a skos:Concept ; dcterms:source ; skos:definition """**DropPathway** randomly drops an audio pathway during training as a regularization technique for audiovisual recognition models. Specifically, at each training iteration, we drop the Audio pathway altogether with probability $P\\_{d}$. This way, we slow down the learning of the Audio pathway and make its learning dynamics more compatible with its visual counterpart. When dropping the audio pathway, we sum zero tensors with the visual pathways.\r \r Note that DropPathway is different from simply setting different learning rates for the audio/visual pathways in that it 1) ensures the audio pathway has fewer parameter updates, 2) hinders the visual pathway to 'shortcut' training by memorizing audio information, and 3) provides extra regularization as different audio clips are dropped in each epoch.""" ; skos:prefLabel "DropPathway" . :DualCL a skos:Concept ; dcterms:source ; skos:altLabel "Dual Contrastive Learning" ; skos:definition "Contrastive learning has achieved remarkable success in representation learning via self-supervision in unsupervised settings. However, effectively adapting contrastive learning to supervised learning tasks remains as a challenge in practice. In this work, we introduce a dual contrastive learning (DualCL) framework that simultaneously learns the features of input samples and the parameters of classifiers in the same space. Specifically, DualCL regards the parameters of the classifiers as augmented samples associating to different labels and then exploits the contrastive learning between the input samples and the augmented samples. Empirical studies on five benchmark text classification datasets and their low-resource version demonstrate the improvement in classification accuracy and confirm the capability of learning discriminative representations of DualCL." ; skos:prefLabel "DualCL" . :DualGCN a skos:Concept ; skos:altLabel "Dual Graph Convolutional Networks" ; skos:definition """A dual graph convolutional neural network jointly considers the two essential assumptions of semi-supervised learning: (1) local consistency and (2) global consistency. Accordingly, two convolutional neural networks are devised to embed the local-consistency-based and global-consistency-based knowledge, respectively.\r \r Description and image from: [Dual Graph Convolutional Networks for Graph-Based Semi-Supervised Classification](https://persagen.com/files/misc/zhuang2018dual.pdf)""" ; skos:prefLabel "DualGCN" . :DualSoftmaxLoss a skos:Concept ; dcterms:source ; skos:definition """**Dual Softmax Loss** is a loss function based on symmetric cross-entropy loss used in the [CAMoE](https://paperswithcode.com/method/camoe) video-text retrieval model. Every text and video are calculated the\r similarity with other videos or texts, which should be maximum in terms of the ground truth pair. For DSL, a prior is introduced to revise the similarity score. Multiplying the prior with the original similarity matrix imposes an efficient constraint and can help to filter those single side match pairs. As a result, DSL highlights the one with both great Text-to-Video and Video-to-Text probability, conducting a more convincing result.""" ; skos:prefLabel "Dual Softmax Loss" . :DuelingNetwork a skos:Concept ; dcterms:source ; skos:definition """A **Dueling Network** is a type of Q-Network that has two streams to separately estimate (scalar) state-value and the advantages for each action. Both streams share a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an\r estimate of the state-action value function Q as shown in the figure to the right.\r \r The last module uses the following mapping:\r \r $$ Q\\left(s, a, \\theta, \\alpha, \\beta\\right) =V\\left(s, \\theta, \\beta\\right) + \\left(A\\left(s, a, \\theta, \\alpha\\right) - \\frac{1}{|\\mathcal{A}|}\\sum\\_{a'}A\\left(s, a'; \\theta, \\alpha\\right)\\right) $$\r \r This formulation is chosen for identifiability so that the advantage function has zero advantage for the chosen action, but instead of a maximum we use an average operator to increase the stability of the optimization.""" ; skos:prefLabel "Dueling Network" . :DutchEligibilityTrace a skos:Concept ; skos:definition """A **Dutch Eligibility Trace** is a type of [eligibility trace](https://paperswithcode.com/method/eligibility-trace) where the trace increments grow less quickly than the accumulative eligibility trace (helping avoid large variance updates). For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r \r $$\\mathbf{e\\_{0}} = \\textbf{0}$$\r \r $$\\textbf{e}\\_{t} = \\gamma\\lambda\\textbf{e}\\_{t-1} + \\left(1-\\alpha\\gamma\\lambda\\textbf{e}\\_{t-1}^{T}\\phi\\_{t}\\right)\\phi\\_{t}$$""" ; skos:prefLabel "Dutch Eligibility Trace" . :DyGED a skos:Concept ; dcterms:source ; skos:altLabel "Dynamic Graph Event Detection" ; skos:definition "" ; skos:prefLabel "DyGED" . :DynaBERT a skos:Concept ; dcterms:source ; skos:definition """**DynaBERT** is a [BERT](https://paperswithcode.com/method/bert)-variant which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. \r \r A two-stage procedure is used to train DynaBERT. First, using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks with adaptive width in DynaBERTW. Then, using knowledge distillation (dashed lines) to transfer the knowledge from a trained DynaBERTW to student sub-networks with adaptive width and depth in DynaBERT.""" ; skos:prefLabel "DynaBERT" . :DynamicConv a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Dynamic Convolution" ; skos:definition """**DynamicConv** is a type of [convolution](https://paperswithcode.com/method/convolution) for sequential modelling where it has kernels that vary over time as a learned function of the individual time steps. It builds upon [LightConv](https://paperswithcode.com/method/lightconv) and takes the same form but uses a time-step dependent kernel:\r \r $$ \\text{DynamicConv}\\left(X, i, c\\right) = \\text{LightConv}\\left(X, f\\left(X\\_{i}\\right)\\_{h,:}, i, c\\right) $$""" ; skos:prefLabel "DynamicConv" . :DynamicConvolution a skos:Concept ; dcterms:source ; skos:definition """The extremely low computational cost of lightweight CNNs constrains the depth and width of the networks, further decreasing their representational power. To address the above problem, Chen et al. proposed dynamic convolution, a novel operator design that increases representational power with negligible additional computational cost and does not change the width or depth of the network in parallel with CondConv.\r \r Dynamic convolution uses $K$ parallel convolution kernels of the same size and input/output dimensions instead of one kernel per layer. Like SE blocks, it adopts a squeeze-and-excitation mechanism to generate the attention weights for the different convolution kernels. These kernels are then aggregated dynamically by weighted summation and applied to the input feature map $X$:\r \\begin{align}\r s & = \\text{softmax} (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r \\end{align}\r \\begin{align}\r \\text{DyConv} &= \\sum_{i=1}^{K} s_k \\text{Conv}_k \r \\end{align}\r \\begin{align}\r Y &= \\text{DyConv}(X)\r \\end{align}\r Here the convolutions are combined by summation of weights and biases of convolutional kernels. \r \r Compared to applying convolution to the feature map, the computational cost of squeeze-and-excitation and weighted summation is extremely low. Dynamic convolution thus provides an efficient operation to improve representational power and can be easily used as a replacement for any convolution.""" ; skos:prefLabel "Dynamic Convolution" . :DynamicKeypointHead a skos:Concept ; dcterms:source ; skos:definition """**Dynamic Keypoint Head** is an output head for pose estimation that are conditioned on each instance (person), and can encode the instance concept in the dynamically-generated weights of their filters. They are used in the [FCPose](https://paperswithcode.com/method/fcpose) architecture.\r \r The Figure shows the core idea. $F$ denotes a level of feature maps. "Rel. Coord." means the relative coordinates, denoting the relative offsets from the locations of $F$ to the location where the filters are generated. Refer to the text for details. $f\\_{\\theta\\_{i}}$ is the dynamically-generated keypoint head for the $i$-th person instance. Note that each person instance has its own keypoint head.""" ; skos:prefLabel "Dynamic Keypoint Head" . :DynamicMemoryNetwork a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Dynamic Memory Network** is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. \r \r The DMN consists of a number of modules:\r \r - Input Module: The input module encodes raw text inputs from the task into distributed vector representations. The input takes forms like a sentence, a long story, a movie review and so on.\r - Question Module: The question module encodes the question of the task into a distributed\r vector representation. For question answering, the question may be a sentence such as "Where did the author first fly?". The representation is fed into the episodic memory module, and forms the basis, or initial state, upon which the episodic memory module iterates.\r - Episodic Memory Module: Given a collection of input representations, the episodic memory module chooses which parts of the inputs to focus on through the attention mechanism. It then produces a ”memory” vector representation taking into account the question as well as the previous memory. Each iteration provides the module with newly relevant information about the input. In other words,\r the module has the ability to retrieve new information, in the form of input representations, which were thought to be irrelevant in previous iterations.\r - Answer Module: The answer module generates an answer from the final memory vector of the memory module.""" ; skos:prefLabel "Dynamic Memory Network" . :DynamicR-CNN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Dynamic R-CNN** is an object detection method that adjusts the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of Smooth L1 Loss) automatically based on the statistics of proposals during training. The motivation is that in previous two-stage object detectors, there is an inconsistency problem between the fixed network settings and the dynamic training procedure. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors.\r \r It consists of two components: Dynamic Label Assignment and Dynamic Smooth L1 Loss, which are designed for the classification and regression branches, respectively. \r \r For Dynamic Label Assignment, we want our model to be discriminative for high IoU proposals, so we gradually adjust the IoU threshold for positive/negative samples based on the proposals distribution in the training procedure. Specifically, we set the threshold as the IoU of the proposal at a certain percentage since it can reflect the quality of the overall distribution. \r \r For Dynamic Smooth L1 Loss, we want to change the shape of the regression loss function to adaptively fit the distribution change of error and ensure the contribution of high quality samples to training. This is achieved by adjusting the $\\beta$ in Smooth L1 Loss based on the error distribution of the regression loss function, in which $\\beta$ actually controls the magnitude of the gradient of small errors.""" ; skos:prefLabel "Dynamic R-CNN" . :DynamicSmoothL1Loss a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Dynamic SmoothL1 Loss (DSL)** is a loss function in object detection where we change the shape of loss function to gradually focus on high quality samples:\r \r $$\\text{DSL}\\left(x, \\beta\\_{now}\\right) = 0.5|{x}|^{2}/\\beta\\_{now}, \\text{ if } |x| < \\beta\\_{now}\\text{,} $$ \r $$\\text{DSL}\\left(x, \\beta\\_{now}\\right) = |{x}| - 0.5\\beta\\_{now}\\text{, otherwise} $$ \r \r DSL will change the value of $\\beta\\_{now}$ according to the statistics of regression errors which can reflect the localization accuracy. It was introduced as part of the [Dynamic R-CNN](https://paperswithcode.com/method/dynamic-r-cnn) model.""" ; skos:prefLabel "Dynamic SmoothL1 Loss" . :E-Branchformer a skos:Concept ; dcterms:source ; skos:definition "E-BRANCHFORMER: BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION" ; skos:prefLabel "E-Branchformer" . :E-MBConv a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "E-MBConv" . :E-swish a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "E-swish" . :E2EAdaptiveDistTraining a skos:Concept ; dcterms:source ; skos:altLabel "End-to-end Adaptive Distributed Training" ; skos:definition """Distributed training has become a pervasive and effective approach for training a large neural network\r (NN) model with processing massive data. However, it is very challenging to satisfy requirements\r from various NN models, diverse computing resources, and their dynamic changes during a training\r job. In this study, we design our distributed training framework in a systematic end-to-end view to\r provide the built-in adaptive ability for different scenarios, especially for industrial applications and\r production environments, by fully considering resource allocation, model partition, task placement,\r and distributed execution. Based on the unified distributed graph and the unified cluster object,\r our adaptive framework is equipped with a global cost model and a global planner, which can\r enable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and\r elastic distributed training. The experiments demonstrate that our framework can satisfy various\r requirements from the diversity of applications and the heterogeneity of resources with highly\r competitive performance.""" ; skos:prefLabel "E2EAdaptiveDistTraining" . :EBM a skos:Concept ; dcterms:source ; skos:altLabel "energy-based model" ; skos:definition "" ; skos:prefLabel "EBM" . :ECA-Net a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "An **ECA-Net** is a type of convolutional neural network that utilises an [Efficient Channel Attention](https://paperswithcode.com/method/efficient-channel-attention) module." ; skos:prefLabel "ECA-Net" . :ECANet a skos:Concept ; dcterms:source ; skos:altLabel "efficient channel attention" ; skos:definition """An ECA block has similar formulation to an SE block including a squeeze module for aggregating global spatial information and an efficient excitation module for modeling cross-channel interaction. Instead of indirect correspondence, an ECA block only considers direct interaction between each channel and its k-nearest neighbors to control model complexity. Overall, the formulation of an ECA block is:\r \\begin{align}\r s = F_\\text{eca}(X, \\theta) & = \\sigma (\\text{Conv1D}(\\text{GAP}(X))) \r \\end{align}\r \\begin{align}\r Y & = s X\r \\end{align}\r where $\\text{Conv1D}(\\cdot)$ denotes 1D convolution with a kernel of shape $k$ across the channel domain, to model local cross-channel interaction. The parameter $k$ decides the coverage of interaction, and in ECA the kernel size $k$ is adaptively determined from the channel dimensionality $C$ instead of by manual tuning, using cross-validation:\r \\begin{equation}\r k = \\psi(C) = \\left | \\frac{\\log_2(C)}{\\gamma}+\\frac{b}{\\gamma}\\right |_\\text{odd}\r \\end{equation}\r \r where $\\gamma$ and $b$ are hyperparameters. $|x|_\\text{odd}$ indicates the nearest odd function of $x$. \r \r Compared to SENet, ECANet has an \r improved excitation module, and provides an efficient and effective block which can readily be \r incorporated into various\r CNNs.""" ; skos:prefLabel "ECANet" . :ED-GNN a skos:Concept ; dcterms:source ; skos:altLabel "Medical Entity Disambiguation using Graph Neural Networks" ; skos:definition "" ; skos:prefLabel "ED-GNN" . :EDLPS a skos:Concept ; skos:altLabel "Encoder-Decoder model with local and pairwise loss along with shared encoder and discriminator network (EDLPS)" ; skos:definition """In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of obtaining word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder. This discriminator is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validate our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and provide competitive results on the paraphrase generation and sentiment analysis task on standard dataset. These results are also shown to be statistically significant.\r \r \r \r \r Github Link:https://github.com/dev-chauhan/PQG-pytorch.\r \r 2\r The PQG dataset is available on this link: https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs.\r \r 3\r website: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.\r \r 4\r we report same baseline results as mentioned in [10]\r \r 5\r website: www.kaggle.com/c/sentiment-analysis-on-movie-reviews.\r \r 6\r Code: https://github.com/dev-chauhan/PQG-pytorch.""" ; skos:prefLabel "EDLPS" . :EEND a skos:Concept ; dcterms:source ; skos:altLabel "End-to-End Neural Diarization" ; skos:definition "**End-to-End Neural Diarization** is a neural network for speaker diarization in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, the speaker diarization problem is formulated as a multi-label classification problem and a permutation-free objective function is introduced to directly minimize diarization errors. The EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, the model can be adapted to real conversations." ; skos:prefLabel "EEND" . :EESP a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions" ; skos:definition """An **EESP Unit**, or Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions, is an image model block designed for edge devices. It was proposed as part of the [ESPNetv2](https://paperswithcode.com/method/espnetv2) CNN architecture. \r \r This building block is based on a reduce-split-transform-merge strategy. The EESP unit first projects the high-dimensional input feature maps into low-dimensional space using groupwise pointwise convolutions and then learns the representations in parallel using depthwise dilated separable convolutions with different dilation rates. Different dilation rates in each branch allow the EESP unit to learn the representations from a large effective receptive field. To remove the gridding artifacts caused by dilated convolutions, the EESP fuses the feature maps using [hierarchical feature fusion](https://paperswithcode.com/method/hierarchical-feature-fusion) (HFF).""" ; skos:prefLabel "EESP" . :EGT a skos:Concept ; dcterms:source ; skos:altLabel "Edge-augmented Graph Transformer" ; skos:definition "Transformer neural networks have achieved state-of-the-art results for unstructured data such as text and images but their adoption for graph-structured data has been limited. This is partly due to the difficulty of incorporating complex structural information in the basic transformer framework. We propose a simple yet powerful extension to the transformer - residual edge channels. The resultant framework, which we call Edge-augmented Graph Transformer (EGT), can directly accept, process and output structural information as well as node information. It allows us to use global self-attention, the key element of transformers, directly for graphs and comes with the benefit of long-range interaction among nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. In addition, we introduce a generalized positional encoding scheme for graphs based on Singular Value Decomposition which can improve the performance of EGT. Our framework, which relies on global node feature aggregation, achieves better performance compared to Convolutional/Message-Passing Graph Neural Networks, which rely on local feature aggregation within a neighborhood. We verify the performance of EGT in a supervised learning setting on a wide range of experiments on benchmark datasets. Our findings indicate that convolutional aggregation is not an essential inductive bias for graphs and global self-attention can serve as a flexible and adaptive alternative." ; skos:prefLabel "EGT" . :ELECTRA a skos:Concept ; dcterms:source ; skos:definition "**ELECTRA** is a [transformer](https://paperswithcode.com/method/transformer) with a new pre-training approach which trains two transformer models: the generator and the discriminator. The generator replaces tokens in the sequence - trained as a masked language model - and the discriminator (the ELECTRA contribution) attempts to identify which tokens are replaced by the generator in the sequence. This pre-training task is called replaced token detection, and is a replacement for masking the input." ; skos:prefLabel "ELECTRA" . :ELMo a skos:Concept ; dcterms:source ; skos:definition """**Embeddings from Language Models**, or **ELMo**, is a type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.\r \r A biLM combines both a forward and backward LM. ELMo jointly maximizes the log likelihood of the forward and backward directions. To add ELMo to a supervised model, we freeze the weights of the biLM and then concatenate the ELMo vector $\\textbf{ELMO}^{task}_k$ with $\\textbf{x}_k$ and pass the ELMO enhanced representation $[\\textbf{x}_k; \\textbf{ELMO}^{task}_k]$ into the task RNN. Here $\\textbf{x}_k$ is a context-independent token representation for each token position. \r \r Image Source: [here](https://medium.com/@duyanhnguyen_38925/create-a-strong-text-classification-with-the-help-from-elmo-e90809ba29da)""" ; skos:prefLabel "ELMo" . :ELR a skos:Concept ; dcterms:source ; skos:altLabel "Early Learning Regularization" ; skos:definition "" ; skos:prefLabel "ELR" . :ELU a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Exponential Linear Unit" ; skos:definition """The **Exponential Linear Unit** (ELU) is an activation function for neural networks. In contrast to [ReLUs](https://paperswithcode.com/method/relu), ELUs have negative values which allows them to push mean unit activations closer to zero like [batch normalization](https://paperswithcode.com/method/batch-normalization) but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While [LReLUs](https://paperswithcode.com/method/leaky-relu) and [PReLUs](https://paperswithcode.com/method/prelu) have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information.\r \r The exponential linear unit (ELU) with $0 < \\alpha$ is:\r \r $$f\\left(x\\right) = x \\text{ if } x > 0$$\r $$\\alpha\\left(\\exp\\left(x\\right) − 1\\right) \\text{ if } x \\leq 0$$""" ; skos:prefLabel "ELU" . :ELiSH a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Exponential Linear Squashing Activation" ; skos:definition """The **Exponential Linear Squashing Activation Function**, or **ELiSH**, is an activation function used for neural networks. It shares common properties with [Swish](https://paperswithcode.com/method/swish), being made up of an [ELU](https://paperswithcode.com/method/elu) and a [Sigmoid](https://paperswithcode.com/method/sigmoid-activation):\r \r $$f\\left(x\\right) = \\frac{x}{1+e^{-x}} \\text{ if } x \\geq 0 $$\r $$f\\left(x\\right) = \\frac{e^{x} - 1}{1+e^{-x}} \\text{ if } x < 0 $$\r \r The Sigmoid part of **ELiSH** improves information flow, while the linear parts solve issues of vanishing gradients.""" ; skos:prefLabel "ELiSH" . :EMEA a skos:Concept ; dcterms:source ; skos:altLabel "Entropy Minimized Ensemble of Adapters" ; skos:definition "**Entropy Minimized Ensemble of Adapters**, or **EMEA**, is a method that optimizes the ensemble weights of the pretrained language adapters for each test sentence by minimizing the entropy of its predictions. The intuition behind the method is that a good [adapter](https://paperswithcode.com/method/adapter) weight $\\alpha$ for a test input $x$ should make the model more confident in its prediction for $x$, that is, it should lead to lower model entropy over the input" ; skos:prefLabel "EMEA" . :EMF a skos:Concept ; dcterms:source ; skos:altLabel "Enhanced-Multimodal Fuzzy Framework" ; skos:definition "BCI MI framework to classifiy brain signals using a multimodal decission making phase, with an addtional differentiation of the signal." ; skos:prefLabel "EMF" . :EMQAP a skos:Concept ; dcterms:source ; skos:definition "**EMQAP**, or **E-Manual Question Answering Pipeline**, is an approach for answering questions pertaining to electronics devices. Built upon the pretrained [RoBERTa](https://paperswithcode.com/method/roberta), it harbors a supervised multi-task learning framework which efficiently performs the dual tasks of identifying the section in the E-manual where the answer can be found and the exact answer span within that section." ; skos:prefLabel "EMQAP" . :ENIGMA a skos:Concept ; dcterms:source ; skos:definition "**ENIGMA** is an evaluation framework for dialog systems based on Pearson and Spearman's rank correlations between the estimated rewards and the true rewards. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors." ; skos:prefLabel "ENIGMA" . :ENet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**ENet** is a semantic segmentation architecture which utilises a compact encoder-decoder architecture. Some design choices include:\r \r 1. Using the [SegNet](https://paperswithcode.com/method/segnet) approach to downsampling y saving indices of elements chosen in max\r pooling layers, and using them to produce sparse upsampled maps in the decoder.\r 2. Early downsampling to optimize the early stages of the network and reduce the cost of processing large input frames. The first two blocks of ENet heavily reduce the input size, and use only a small set of feature maps. \r 3. Using PReLUs as an activation function\r 4. Using dilated convolutions \r 5. Using Spatial [Dropout](https://paperswithcode.com/method/dropout)""" ; skos:prefLabel "ENet" . :ENetBottleneck a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**ENet Bottleneck** is an image model block used in the [ENet](https://paperswithcode.com/method/enet) semantic segmentation architecture. Each block consists of three convolutional layers: a 1 × 1 projection that reduces the dimensionality, a main convolutional layer, and a 1 × 1 expansion. We place [Batch Normalization](https://paperswithcode.com/method/batch-normalization) and [PReLU](https://paperswithcode.com/method/prelu) between all convolutions. If the bottleneck is downsampling, a [max pooling](https://paperswithcode.com/method/max-pooling) layer is added to the main branch.\r Also, the first 1 × 1 projection is replaced with a 2 × 2 [convolution](https://paperswithcode.com/method/convolution) with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps.""" ; skos:prefLabel "ENet Bottleneck" . :ENetDilatedBottleneck a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**ENet Dilated Bottleneck** is an image model block used in the [ENet](https://paperswithcode.com/method/enet) semantic segmentation architecture. It is the same as a regular [ENet Bottleneck](https://paperswithcode.com/method/enet-bottleneck) but employs dilated convolutions instead." ; skos:prefLabel "ENet Dilated Bottleneck" . :ENetInitialBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "The **ENet Initial Block** is an image model block used in the [ENet](https://paperswithcode.com/method/enet) semantic segmentation architecture. [Max Pooling](https://paperswithcode.com/method/max-pooling) is performed with non-overlapping 2 × 2 windows, and the [convolution](https://paperswithcode.com/method/convolution) has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules." ; skos:prefLabel "ENet Initial Block" . :ERNIE a skos:Concept ; dcterms:source ; skos:definition "ERNIE is a transformer-based model consisting of two stacked modules: 1) textual encoder and 2) knowledgeable encoder, which is responsible to integrate extra token-oriented knowledge information into textual information. This layer consists of stacked aggregators, designed for encoding both tokens and entities as well as fusing their heterogeneous features. To integrate this layer of enhancing representations via knowledge, a special pre-training task is adopted for ERNIE - it involves randomly masking token-entity alignments and training the model to predict all corresponding entities based on aligned tokens (aka denoising entity auto-encoder)." ; skos:prefLabel "ERNIE" . :ERNIE-GEN a skos:Concept ; dcterms:source ; skos:definition "**ERNIE-GEN** is a multi-flow sequence to sequence pre-training and fine-tuning framework which bridges the discrepancy between training and inference with an infilling generation mechanism and a noise-aware generation method. To make generation closer to human writing patterns, this framework introduces a span-by-span generation flow that trains the model to predict semantically-complete spans consecutively rather than predicting word by word. Unlike existing pre-training methods, ERNIE-GEN incorporates multi-granularity target sampling to construct pre-training data, which enhances the correlation between encoder and decoder." ; skos:prefLabel "ERNIE-GEN" . :ERU a skos:Concept ; dcterms:source ; skos:altLabel "Efficient Recurrent Unit" ; skos:definition "An **Efficient Recurrent Unit (ERU)** extends [LSTM](https://paperswithcode.com/method/mrnn)-based language models by replacing linear transforms for processing the input vector with the [EESP](https://paperswithcode.com/method/eesp) unit inside the [LSTM](https://paperswithcode.com/method/lstm) cell." ; skos:prefLabel "ERU" . :ESACL a skos:Concept ; dcterms:source ; skos:altLabel "Enhanced Seq2Seq Autoencoder via Contrastive Learning" ; skos:definition "**ESACL**, or **Enhanced Seq2Seq Autoencoder via Contrastive Learning**, is a denoising sequence-to-sequence (seq2seq) autoencoder via contrastive learning for abstractive text summarization. The model adopts a standard [Transformer](https://paperswithcode.com/method/transformer)-based architecture with a multilayer bi-directional encoder and an autoregressive decoder. To enhance its denoising ability, self-supervised contrastive learning is incorporated along with various sentence-level document augmentation." ; skos:prefLabel "ESACL" . :ESIM a skos:Concept ; dcterms:source ; skos:altLabel "Enhanced Sequential Inference Model" ; skos:definition "**Enhanced Sequential Inference Model** or **ESIM** is a sequential NLI model proposed in [Enhanced LSTM for Natural Language Inference](https://www.aclweb.org/anthology/P17-1152) paper." ; skos:prefLabel "ESIM" . :ESP a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Efficient Spatial Pyramid" ; skos:definition "An **Efficient Spatial Pyramid (ESP)** is an image model block based on a factorization principle that decomposes a standard [convolution](https://paperswithcode.com/method/convolution) into two steps: (1) point-wise convolutions and (2) spatial pyramid of dilated convolutions. The point-wise convolutions help in reducing the computation, while the spatial pyramid of dilated convolutions re-samples the feature maps to learn the representations from large effective receptive field. This allows for increased efficiency compared to another image blocks like [ResNeXt](https://paperswithcode.com/method/resnext) blocks and Inception modules." ; skos:prefLabel "ESP" . :ESPNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**ESPNet** is a convolutional neural network for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a convolutional module, efficient spatial pyramid ([ESP](https://paperswithcode.com/method/esp)), which is efficient in terms of computation, memory, and power." ; skos:prefLabel "ESPNet" . :ESPNetv2 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**ESPNetv2** is a convolutional neural network that utilises group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters." ; skos:prefLabel "ESPNetv2" . :ETC a skos:Concept ; dcterms:source ; skos:altLabel "Extended Transformer Construction" ; skos:definition "**Extended Transformer Construction**, or **ETC**, is an extension of the [Transformer](https://paperswithcode.com/method/transformer) architecture with a new attention mechanism that extends the original in two main ways: (1) it allows scaling up the input length from 512 to several thousands; and (2) it can ingesting structured inputs instead of just linear sequences. The key ideas that enable ETC to achieve these are a new [global-local attention mechanism](https://paperswithcode.com/method/global-local-attention), coupled with [relative position encodings](https://paperswithcode.com/method/relative-position-encodings). ETC also allows lifting weights from existing [BERT](https://paperswithcode.com/method/bert) models, saving computational resources while training." ; skos:prefLabel "ETC" . :EVM a skos:Concept ; dcterms:source ; skos:altLabel "Extreme Value Machine" ; skos:definition "" ; skos:prefLabel "EVM" . :EWC a skos:Concept ; dcterms:source ; skos:altLabel "Elastic Weight Consolidation" ; skos:definition "The methon to overcome catastrophic forgetting in neural network while continual learning" ; skos:prefLabel "EWC" . :EarlyDropout a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "Introduced by Hinton et al. in 2012, dropout has stood the test of time as a regularizer for preventing overfitting in neural networks. In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. During the early phase, we find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient. This helps counteract the stochasticity of SGD and limit the influence of individual batches on model training. Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards. Models equipped with early dropout achieve lower final training loss compared to their counterparts without dropout. Additionally, we explore a symmetric technique for regularizing overfitting models - late dropout, where dropout is not used in the early iterations and is only activated later in training. Experiments on ImageNet and various vision tasks demonstrate that our methods consistently improve generalization accuracy. Our results encourage more research on understanding regularization in deep learning and our methods can be useful tools for future neural network training, especially in the era of large data. Code is available at https://github.com/facebookresearch/dropout ." ; skos:prefLabel "Early Dropout" . :EarlyStopping a skos:Concept ; skos:definition """**Early Stopping** is a regularization technique for deep neural networks that stops training when parameter updates no longer begin to yield improves on a validation set. In essence, we store and update the current best parameters during training, and when parameter updates no longer yield an improvement (after a set number of iterations) we stop training and use the last best parameters. It works as a regularizer by restricting the optimization procedure to a smaller volume of parameter space.\r \r Image Source: [Ramazan Gençay](https://www.researchgate.net/figure/Early-stopping-based-on-cross-validation_fig1_3302948)""" ; skos:prefLabel "Early Stopping" . :Earlyexiting a skos:Concept ; dcterms:source ; skos:altLabel "Early exiting using confidence measures" ; skos:definition "Exit whenever the model is confident enough allowing early exiting from hidden layers" ; skos:prefLabel "Early exiting" . :EdgeBoxes a skos:Concept ; skos:definition """**EdgeBoxes** is an approach for generating object bounding box proposals directly from edges. Similar to segments, edges provide a simplified but informative representation of an image. In fact, line drawings of an image can accurately convey the high-level information contained in an image\r using only a small fraction of the information. \r \r The main insight behind the method is the observation: the number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object. We say a contour is wholly enclosed by a box if all edge pixels belonging to the contour lie within the interior of the box. Edges tend to correspond to object boundaries, and as such boxes that tightly enclose a set of edges are likely to contain an object. However, some edges that lie within an object’s bounding box may not be part of the contained object. Specifically, edge pixels that belong to contours straddling the box’s boundaries are likely to correspond to objects or structures that lie outside the box.\r \r Source: [Zitnick and Dollar](https://pdollar.github.io/files/papers/ZitnickDollarECCV14edgeBoxes.pdf)""" ; skos:prefLabel "EdgeBoxes" . :EdgeFlow a skos:Concept ; dcterms:source ; skos:definition """**EdgeFlow** is an interactive segmentation architecture that fully utilizes interactive information of user clicks with edge-guided flow. Edge guidance is the idea that interactive segmentation improves segmentation masks progressively with user clicks. Based on user clicks, an edge mask scheme is used, which takes the object edges estimated from the previous iteration as prior information, instead of direct mask estimation (if the previous mask is used as input, poor segmentation results could result).\r \r The architecture consists of a coarse-to-fine network including CoarseNet and FineNet. For CoarseNet, [HRNet](https://paperswithcode.com/method/hrnet)-18+OCR is utilized as the base segmentation model and the edge-guided flow is appended to deal with interactive information. For FineNet, three [atrous convolution](https://paperswithcode.com/method/dilated-convolution) blocks are utilized to refine the coarse masks.""" ; skos:prefLabel "EdgeFlow" . :EffectiveSqueeze-and-ExcitationBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Effective Squeeze-and-Excitation Block** is an image model block based on squeeze-and-excitation, the difference being that one less FC layer is used. The authors note the SE module has a limitation: channel information loss due to dimension reduction. For avoiding high model complexity burden, two FC layers of the SE module need to reduce channel dimension. Specifically, while the first FC layer reduces input feature channels $C$ to $C/r$ using reduction ratio $r$, the second FC layer expands the reduced channels to original channel size $C$. As a result, this channel dimension reduction causes channel information loss. Therefore, effective SE (eSE) uses only one FC layer with $C$ channels instead of two FCs without channel dimension reduction, which maintains channel information." ; skos:prefLabel "Effective Squeeze-and-Excitation Block" . :EfficientChannelAttention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Efficient Channel Attention** is an architectural unit based on [squeeze-and-excitation](https://paperswithcode.com/method/squeeze-and-excitation-block) blocks that reduces model complexity without dimensionality reduction. It was proposed as part of the [ECA-Net](https://paperswithcode.com/method/eca-net) CNN architecture. \r \r After channel-wise [global average pooling](https://paperswithcode.com/method/global-average-pooling) without dimensionality reduction, the ECA captures local cross-channel interaction by considering every channel and its $k$ neighbors. The ECA can be efficiently implemented by fast $1D$ [convolution](https://paperswithcode.com/method/convolution) of size $k$, where kernel size $k$ represents the coverage of local cross-channel interaction, i.e., how many neighbors participate in attention prediction of one channel.""" ; skos:prefLabel "Efficient Channel Attention" . :EfficientDet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**EfficientDet** is a type of object detection model, which utilizes several optimization and backbone tweaks, such as the use of a [BiFPN](https://paperswithcode.com/method/bifpn), and a compound scaling method that uniformly scales the resolution,depth and width for all backbones, feature networks and box/class prediction networks at the same time." ; skos:prefLabel "EfficientDet" . :EfficientNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\\alpha ^ N$, width by $\\beta ^ N$, and image size by $\\gamma ^ N$, where $\\alpha, \\beta, \\gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\\phi$ to uniformly scales network width, depth, and resolution in a principled way.\r \r The compound scaling method is justified by the intuition that if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image.\r \r The base EfficientNet-B0 network is based on the inverted bottleneck residual blocks of [MobileNetV2](https://paperswithcode.com/method/mobilenetv2), in addition to squeeze-and-excitation blocks.\r \r EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.""" ; skos:prefLabel "EfficientNet" . :EfficientNetV2 a skos:Concept ; dcterms:source ; skos:definition """**EfficientNetV2** is a type convolutional neural network that has faster training speed and better parameter efficiency than [previous models](https://paperswithcode.com/method/efficientnet). To develop these models, the authors use a combination of training-aware [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) and scaling, to jointly optimize training speed. The models were searched from the search space enriched with new ops such as [Fused-MBConv](https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html).\r \r Architecturally the main differences are:\r \r - EfficientNetV2 extensively uses both [MBConv](https://paperswithcode.com/method/inverted-residual-block) and the newly added fused-MBConv in the early layers.\r - EfficientNetV2 prefers smaller expansion ratio for [MBConv](https://paperswithcode.com/method/inverted-residual-block) since smaller expansion ratios tend to have less memory access overhead.\r - EfficientNetV2 prefers smaller 3x3 kernel sizes, but it adds more layers to compensate the reduced receptive field resulted from the smaller kernel size. \r - EfficientNetV2 completely removes the last stride-1 stage in the original EfficientNet, wperhaps due to its large parameter size and memory access overhead.""" ; skos:prefLabel "EfficientNetV2" . a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """Decoder architecture inspired on the [UNet++](https://paperswithcode.com/method/unet) structure and the [EfficientNet](https://paperswithcode.com/method/efficientnet) building blocks. Keeping the UNet++ structure, the EfficientUNet++ achieves higher performance and significantly lower computational complexity through two simple modifications:\r \r * Replaces the 3x3 convolutions of the UNet++ with residual bottleneck blocks with depthwise convolutions\r * Applies channel and spatial attention to the bottleneck feature maps using [concurrent spatial and channel squeeze & excitation (scSE)](https://paperswithcode.com/method/scse) blocks""" ; skos:prefLabel "EfficientUNet++" . :ElasticDenseBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Elastic Dense Block** is a skip connection block that modifies the [Dense Block](https://paperswithcode.com/method/dense-block) with downsamplings and upsamplings in parallel branches at each layer to let the network learn from a data scaling policy in which inputs are processed at different resolutions in each layer. It is called \"elastic\" because each layer in the network is flexible in terms of choosing the best scale by a soft policy." ; skos:prefLabel "Elastic Dense Block" . :ElasticFace a skos:Concept ; dcterms:source ; skos:altLabel "Elastic Margin Loss for Deep Face Recognition" ; skos:definition "" ; skos:prefLabel "ElasticFace" . :ElasticResNeXtBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "An **Elastic ResNeXt Block** is a modification of the [ResNeXt Block](https://paperswithcode.com/method/resnext-block) that adds downsamplings and upsamplings in parallel branches at each layer. It is called \"elastic\" because each layer in the network is flexible in terms of choosing the best scale by a soft policy." ; skos:prefLabel "Elastic ResNeXt Block" . :Electric a skos:Concept ; dcterms:source ; skos:definition """**Electric** is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.\r \r Specifically, like BERT, Electric also models $p\\_{\\text {data }}\\left(x\\_{t} \\mid \\mathbf{x}\\_{\\backslash t}\\right)$, but does not use masking or a softmax layer. Electric first maps the unmasked input $\\mathbf{x}=\\left[x\\_{1}, \\ldots, x\\_{n}\\right]$ into contextualized vector representations $\\mathbf{h}(\\mathbf{x})=\\left[\\mathbf{h}\\_{1}, \\ldots, \\mathbf{h}\\_{n}\\right]$ using a transformer network. The model assigns a given position $t$ an energy score\r \r $$\r E(\\mathbf{x})\\_{t}=\\mathbf{w}^{T} \\mathbf{h}(\\mathbf{x})\\_{t}\r $$\r \r using a learned weight vector $w$. The energy function defines a distribution over the possible tokens at position $t$ as\r \r $$\r p\\_{\\theta}\\left(x\\_{t} \\mid \\mathbf{x}_{\\backslash t}\\right)=\\exp \\left(-E(\\mathbf{x})\\_{t}\\right) / Z\\left(\\mathbf{x}\\_{\\backslash t}\\right) \r $$\r \r $$\r =\\frac{\\exp \\left(-E(\\mathbf{x})\\_{t}\\right)}{\\sum\\_{x^{\\prime} \\in \\mathcal{V}} \\exp \\left(-E\\left(\\operatorname{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)\\right)\\_{t}\\right)}\r $$\r \r where $\\text{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)$ denotes replacing the token at position $t$ with $x^{\\prime}$ and $\\mathcal{V}$ is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens $x^{\\prime}$ using a softmax layer, a candidate $x^{\\prime}$ is passed in as input to the transformer. As a result, computing $p_{\\theta}$ is prohibitively expensive because the partition function $Z\\_{\\theta}\\left(\\mathbf{x}\\_{\\backslash t}\\right)$ requires running the transformer $|\\mathcal{V}|$ times; unlike most EBMs, the intractability of $Z\\_{\\theta}(\\mathbf{x} \\backslash t)$ is more due to the expensive scoring function rather than having a large sample space.""" ; skos:prefLabel "Electric" . :EligibilityTrace a skos:Concept ; skos:definition """An **Eligibility Trace** is a memory vector $\\textbf{z}\\_{t} \\in \\mathbb{R}^{d}$ that parallels the long-term weight vector $\\textbf{w}\\_{t} \\in \\mathbb{R}^{d}$. The idea is that when a component of $\\textbf{w}\\_{t}$ participates in producing an estimated value, the corresponding component of $\\textbf{z}\\_{t}$ is bumped up and then begins to fade away. Learning will then occur in that component of $\\textbf{w}\\_{t}$ if a nonzero TD error occurs before the trade falls back to zero. The trace-decay parameter $\\lambda \\in \\left[0, 1\\right]$ determines the rate at which the trace falls.\r \r Intuitively, they tackle the credit assignment problem by capturing both a frequency heuristic - states that are visited more often deserve more credit - and a recency heuristic - states that are visited more recently deserve more credit.\r \r $$E\\_{0}\\left(s\\right) = 0 $$\r $$E\\_{t}\\left(s\\right) = \\gamma\\lambda{E}\\_{t-1}\\left(s\\right) + \\textbf{1}\\left(S\\_{t} = s\\right) $$\r \r Source: Sutton and Barto, Reinforcement Learning, 2nd Edition""" ; skos:prefLabel "Eligibility Trace" . :EmbeddedDotProductAffinity a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Embedded Dot Product Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a dot product function in an embedding space:\r \r $$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = \\theta\\left(\\mathbb{x\\_{i}}\\right)^{T}\\phi\\left(\\mathbb{x\\_{j}}\\right) $$\r \r Here $\\theta\\left(x\\_{i}\\right) = W\\_{θ}x\\_{i}$ and $\\phi\\left(x\\_{j}\\right) = W\\_{φ}x\\_{j}$ are two embeddings.\r \r The main difference between the dot product and [embedded Gaussian affinity](https://paperswithcode.com/method/embedded-gaussian-affinity) functions is the presence of [softmax](https://paperswithcode.com/method/softmax), which plays the role of an activation function.""" ; skos:prefLabel "Embedded Dot Product Affinity" . :EmbeddedGaussianAffinity a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Embedded Gaussian Affinity** is a type of affinity or self-similarity function between two points $\\mathbf{x\\_{i}}$ and $\\mathbf{x\\_{j}}$ that uses a Gaussian function in an embedding space:\r \r $$ f\\left(\\mathbf{x\\_{i}}, \\mathbf{x\\_{j}}\\right) = e^{\\theta\\left(\\mathbf{x\\_{i}}\\right)^{T}\\phi\\left(\\mathbf{x\\_{j}}\\right)} $$\r \r Here $\\theta\\left(x\\_{i}\\right) = W\\_{θ}x\\_{i}$ and $\\phi\\left(x\\_{j}\\right) = W\\_{φ}x\\_{j}$ are two embeddings.\r \r Note that the self-attention module used in the original [Transformer](https://paperswithcode.com/method/transformer) model is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given $i$, $\\frac{1}{\\mathcal{C}\\left(\\mathbf{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbf{x}\\_{i}, \\mathbf{x}\\_{j}\\right)g\\left(\\mathbf{x}\\_{j}\\right)$ becomes the [softmax](https://paperswithcode.com/method/softmax) computation along the dimension $j$. So we have $\\mathbf{y} = \\text{softmax}\\left(\\mathbf{x}^{T}W^{T}\\_{\\theta}W\\_{\\phi}\\mathbf{x}\\right)g\\left(\\mathbf{x}\\right)$, which is the self-attention form in the Transformer model. This shows how we can relate this recent self-attention model to the classic computer vision method of non-local means.""" ; skos:prefLabel "Embedded Gaussian Affinity" . :EmbeddingDropout a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Embedding Dropout** is equivalent to performing [dropout](https://paperswithcode.com/method/dropout) on the embedding matrix at a word level, where the dropout is broadcast across all the word vector’s embedding. The remaining non-dropped-out word embeddings are scaled by $\\frac{1}{1-p\\_{e}}$ where $p\\_{e}$ is the probability of embedding dropout. As the dropout occurs on the embedding matrix that is used for a full forward and backward pass, this means that all occurrences of a specific word will disappear within that pass, equivalent to performing [variational dropout](https://paperswithcode.com/method/variational-dropout) on the connection between the one-hot embedding and the embedding lookup.\r \r Source: Merity et al, Regularizing and Optimizing [LSTM](https://paperswithcode.com/method/lstm) Language Models""" ; skos:prefLabel "Embedding Dropout" . :EmbraceNet a skos:Concept ; dcterms:source ; skos:altLabel "EmbraceNet: A robust deep learning architecture for multimodal classification" ; skos:definition "" ; skos:prefLabel "EmbraceNet" . :EncAttAgg a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Encoder-Attender-Aggregator" ; skos:definition "EncAttAgg introduced two attenders to tackle two problems: 1) We introduce a mutual attender layer to efficiently obtain the entity-pair-specific mention representations. 2) We introduce an integration attender to weight mention pairs of a target entity pair." ; skos:prefLabel "EncAttAgg" . :End-To-EndMemoryNetwork a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """An **End-to-End Memory Network** is a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of [Memory Network](https://paperswithcode.com/method/memory-network), but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol.\r \r The model takes a discrete set of inputs $x\\_{1}, \\dots, x\\_{n}$ that are to be stored in the memory, a query $q$, and outputs an answer $a$. Each of the $x\\_{i}$, $q$, and $a$ contains symbols coming from a dictionary with $V$ words. The model writes all $x$ to the memory up to a fixed buffer size, and then finds a continuous representation for the $x$ and $q$. The continuous representation is then processed via multiple hops to output $a$.""" ; skos:prefLabel "End-To-End Memory Network" . :EnergyBasedProcess a skos:Concept ; dcterms:source ; skos:definition "**Energy Based Processes** extend energy based models to exchangeable data while allowing neural network parameterizations of the energy function. They extend the previously separate stochastic process and latent variable model perspectives in a common framework. The result is a generalization of [Gaussian processes](https://paperswithcode.com/method/gaussian-process) and Student-t processes that exploits EBMs for greater flexibility." ; skos:prefLabel "Energy Based Process" . :EnhancedFusionFramework a skos:Concept ; dcterms:source ; skos:definition """The **Enhanced Fusion Framework** proposes three different ideas to improve the existing MI-based BCI frameworks.\r \r Image source: [Fumanal-Idocin et al.](https://arxiv.org/pdf/2101.06968v1.pdf)""" ; skos:prefLabel "Enhanced Fusion Framework" . :EnsembleClustering a skos:Concept ; dcterms:source ; skos:definition """Ensemble clustering, also called consensus clustering, has\r been attracting much attention in recent years, aiming to combine multiple base clustering algorithms into a better and more consensus clustering. Due to its good performance, ensemble clustering plays a vital role in many research areas, such as community detection and bioinformatics.""" ; skos:prefLabel "Ensemble Clustering" . :EntropyRegularization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Entropy Regularization** is a type of regularization used in [reinforcement learning](https://paperswithcode.com/methods/area/reinforcement-learning). For on-policy policy gradient based methods like [A3C](https://paperswithcode.com/method/a3c), the same mutual reinforcement behaviour leads to a highly-peaked $\\pi\\left(a\\mid{s}\\right)$ towards a few actions or action sequences, since it is easier for the actor and critic to overoptimise to a small portion of the environment. To reduce this problem, entropy regularization adds an entropy term to the loss to promote action diversity:\r \r $$H(X) = -\\sum\\pi\\left(x\\right)\\log\\left(\\pi\\left(x\\right)\\right) $$\r \r Image Credit: Wikipedia""" ; skos:prefLabel "Entropy Regularization" . :EpsilonGreedyExploration a skos:Concept ; skos:definition """**$\\epsilon$-Greedy Exploration** is an exploration strategy in reinforcement learning that takes an exploratory action with probability $\\epsilon$ and a greedy action with probability $1-\\epsilon$. It tackles the exploration-exploitation tradeoff with reinforcement learning algorithms: the desire to explore the state space with the desire to seek an optimal policy. Despite its simplicity, it is still commonly used as an behaviour policy $\\pi$ in several state-of-the-art reinforcement learning models.\r \r Image Credit: [Robin van Embden](https://cran.r-project.org/web/packages/contextual/vignettes/sutton_barto.html)""" ; skos:prefLabel "Epsilon Greedy Exploration" . :EsViT a skos:Concept ; dcterms:source ; skos:definition "**EsViT** proposes two techniques for developing efficient self-supervised vision transformers for visual representation leaning: a multi-stage architecture with sparse self-attention and a new pre-training task of region matching. The multi-stage architecture reduces modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. The new pretraining task allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations." ; skos:prefLabel "EsViT" . :EstimationStatistics a skos:Concept ; skos:definition "Estimation statistics is a data analysis framework that uses a combination of effect sizes, confidence intervals, precision planning, and meta-analysis to plan experiments, analyze data and interpret results. It is distinct from null hypothesis significance testing (NHST), which is considered to be less informative. The primary aim of estimation methods is to report an effect size (a point estimate) along with its confidence interval, the latter of which is related to the precision of the estimate. The confidence interval summarizes a range of likely values of the underlying population effect. Proponents of estimation see reporting a P value as an unhelpful distraction from the important business of reporting an effect size with its confidence intervals, and believe that estimation should replace significance testing for data analysis." ; skos:prefLabel "Estimation Statistics" . :EuclideanNormRegularization a skos:Concept ; dcterms:source ; skos:definition """**Euclidean Norm Regularization** is a regularization step used in [generative adversarial networks](https://paperswithcode.com/methods/category/generative-adversarial-networks), and is typically added to both the generator and discriminator losses:\r \r $$ R\\_{z} = w\\_{r} \\cdot ||\\Delta{z}||^{2}\\_{2} $$\r \r where the scalar weight $w\\_{r}$ is a parameter.\r \r Image: [LOGAN](https://paperswithcode.com/method/logan)""" ; skos:prefLabel "Euclidean Norm Regularization" . :EvoNorms a skos:Concept ; dcterms:source ; skos:definition "**EvoNorms** are a set of normalization-activation layers that go beyond existing design patterns. Normalization and activation are unified into a single computation graph, its structure is evolved starting from low-level primitives. EvoNorms consist of two series: B series and S series. The B series are batch-dependent and were discovered by our method without any constraint. The S series work on individual samples, and were discovered by rejecting any batch-dependent operations." ; skos:prefLabel "EvoNorms" . :ExactFusionModel a skos:Concept ; dcterms:source ; skos:definition "**Exact Fusion Model (EFM)** is a method for aggregating a feature pyramid. The EFM is based on [YOLOv3](https://paperswithcode.com/method/yolov3), which assigns exactly one bounding-box prior to each ground truth object. Each ground truth bounding box corresponds to one anchor box that surpasses the threshold IoU. If the size of an anchor box is equivalent to the field-of-view of the grid cell, then for the grid cells of the $s$-th scale, the corresponding bounding box will be lower bounded by the $(s − 1)$th scale and upper bounded by the (s + 1)th scale. Therefore, the EFM assembles features from the three scales." ; skos:prefLabel "Exact Fusion Model" . :ExpectedSarsa a skos:Concept ; skos:definition """**Expected Sarsa** is like [Q-learning](https://paperswithcode.com/method/q-learning) but instead of taking the maximum over next state-action pairs, we use the expected value, taking into account how likely each action is under the current policy.\r \r $$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\sum\\_{a}\\pi\\left(a\\mid{S\\_{t+1}}\\right)Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r \r Except for this change to the update rule, the algorithm otherwise follows the scheme of Q-learning. It is more computationally expensive than [Sarsa](https://paperswithcode.com/method/sarsa) but it eliminates the variance due to the random selection of $A\\_{t+1}$.\r \r Source: Sutton and Barto, Reinforcement Learning, 2nd Edition""" ; skos:prefLabel "Expected Sarsa" . :ExperienceReplay a skos:Concept ; skos:definition """**Experience Replay** is a replay memory technique used in reinforcement learning where we store the agent’s experiences at each time-step, $e\\_{t} = \\left(s\\_{t}, a\\_{t}, r\\_{t}, s\\_{t+1}\\right)$ in a data-set $D = e\\_{1}, \\cdots, e\\_{N}$ , pooled over many episodes into a replay memory. We then usually sample the memory randomly for a minibatch of experience, and use this to learn off-policy, as with Deep Q-Networks. This tackles the problem of autocorrelation leading to unstable training, by making the problem more like a supervised learning problem.\r \r Image Credit: [Hands-On Reinforcement Learning with Python, Sudharsan Ravichandiran](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788836524)""" ; skos:prefLabel "Experience Replay" . :ExplanationvsAttention a skos:Concept ; skos:altLabel "Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA" ; skos:definition "In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-[CAM](https://paperswithcode.com/method/cam)) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. Visualization of the results also confirms our hypothesis that attention maps improve using this form of supervision." ; skos:prefLabel "Explanation vs Attention" . :ExponentialDecay a skos:Concept ; skos:definition """**Exponential Decay** is a learning rate schedule where we decay the learning rate with more iterations using an exponential function:\r \r $$ \\text{lr} = \\text{lr}\\_{0}\\exp\\left(-kt\\right) $$\r \r Image Credit: [Suki Lau](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)""" ; skos:prefLabel "Exponential Decay" . :ExtremeNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**ExtremeNet** is a a bottom-up object detection framework that detects four extreme points (top-most, left-most, bottom-most, right-most) of an object. It uses a keypoint estimation framework to find extreme points, by predicting four multi-peak heatmaps for each object category. In addition, it uses one [heatmap](https://paperswithcode.com/method/heatmap) per category predicting the object center, as the average of two bounding box edges in both the x and y dimension. We group extreme points into objects with a purely geometry-based approach. We group four extreme points, one from each map, if and only if their\r geometric center is predicted in the center heatmap with a score higher than a pre-defined threshold, We enumerate all $O\\left(n^{4}\\right)$ combinations of extreme point prediction, and select the valid ones.""" ; skos:prefLabel "ExtremeNet" . :F2DNet a skos:Concept ; dcterms:source ; skos:altLabel "Fast Focal Detection Network" ; skos:definition """F2DNet, a novel two-stage object detection architecture which eliminates redundancy of classical two-stage detectors by replacing the region proposal network with focal detection network and\r bounding box head with fast suppression head.""" ; skos:prefLabel "F2DNet" . :FA a skos:Concept ; dcterms:source ; skos:altLabel "Feedback Alignment" ; skos:definition "" ; skos:prefLabel "FA" . :FASFA a skos:Concept ; skos:altLabel "FASFA: A Novel Next-Generation Backpropagation Optimizer" ; skos:definition "This paper introduces the fast adaptive stochastic function accelerator (FASFA) for gradient-based optimization of stochastic objective functions. It works based on Nesterov-enhanced first and second momentum estimates. The method is simple and effective during implementation because it has intuitive/familiar hyperparameterization. The training dynamics can be progressive or conservative depending on the decay rate sum. It works well with a low learning rate and mini batch size. Experiments and statistics showed convincing evidence that FASFA could be an ideal candidate for optimizing stochastic objective functions, particularly those generated by multilayer perceptrons with convolution and dropout layers. In addition, the convergence properties and regret bound provide results aligning with the online convex optimization framework. In a first of its kind, FASFA addresses the growing need for diverse optimizers by providing next-generation training dynamics for artificial intelligence algorithms. Future experiments could modify FASFA based on the infinity norm." ; skos:prefLabel "FASFA" . a skos:Concept ; dcterms:source ; skos:altLabel "Fast Attention Via Positive Orthogonal Random Features" ; skos:definition """**FAVOR+**, or **Fast Attention Via Positive Orthogonal Random Features**, is an efficient attention mechanism used in the [Performer](https://paperswithcode.com/method/performer) architecture which leverages approaches such as kernel methods and random features approximation for approximating [softmax](https://paperswithcode.com/method/softmax) and Gaussian kernels. \r \r FAVOR+ works for attention blocks using matrices $\\mathbf{A} \\in \\mathbb{R}^{L×L}$ of the form $\\mathbf{A}(i, j) = K(\\mathbf{q}\\_{i}^{T}, \\mathbf{k}\\_{j}^{T})$, with $\\mathbf{q}\\_{i}/\\mathbf{k}\\_{j}$ standing for the $i^{th}/j^{th}$ query/key row-vector in $\\mathbf{Q}/\\mathbf{K}$ and kernel $K : \\mathbb{R}^{d } × \\mathbb{R}^{d} \\rightarrow \\mathbb{R}\\_{+}$ defined for the (usually randomized) mapping: $\\phi : \\mathbb{R}^{d } → \\mathbb{R}^{r}\\_{+}$ (for some $r > 0$) as:\r \r $$K(\\mathbf{x}, \\mathbf{y}) = E[\\phi(\\mathbf{x})^{T}\\phi(\\mathbf{y})] $$\r \r We call $\\phi(\\mathbf{u})$ a random feature map for $\\mathbf{u} \\in \\mathbb{R}^{d}$ . For $\\mathbf{Q}^{'}, \\mathbf{K}^{'} \\in \\mathbb{R}^{L \\times r}$ with rows given as $\\phi(\\mathbf{q}\\_{i}^{T})^{T}$ and $\\phi(\\mathbf{k}\\_{i}^{T})^{T}$ respectively, this leads directly to the efficient attention mechanism of the form:\r \r $$ \\hat{Att\\_{\\leftrightarrow}}\\left(\\mathbf{Q}, \\mathbf{K}, \\mathbf{V}\\right) = \\hat{\\mathbf{D}}^{-1}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})^{T}\\mathbf{V}))$$\r \r where\r \r $$\\mathbf{\\hat{D}} = \\text{diag}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})\\mathbf{1}\\_{L})) $$\r \r The above scheme constitutes the [FA](https://paperswithcode.com/method/dfa)-part of the FAVOR+ mechanism. The other parts are achieved by:\r \r - The R part : The softmax kernel is approximated though trigonometric functions, in the form of a regularized softmax-kernel SMREG, that employs positive random features (PRFs).\r - The OR+ part : To reduce the variance of the estimator, so we can use a smaller number of random features, different samples are entangled to be exactly orthogonal using the Gram-Schmidt orthogonalization procedure.\r \r The details are quite technical, so it is recommended you read the paper for further information on these steps.""" ; skos:prefLabel "FAVOR+" . :FBNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**FBNet** is a type of convolutional neural architectures discovered through [DNAS](https://paperswithcode.com/method/dnas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). It utilises a basic type of image model block inspired by [MobileNetv2](https://paperswithcode.com/method/mobilenetv2) that utilises depthwise convolutions and an inverted residual structure (see components)." ; skos:prefLabel "FBNet" . :FBNetBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**FBNet Block** is an image model block used in the [FBNet](https://paperswithcode.com/method/fbnet) architectures discovered through [DNAS](https://paperswithcode.com/method/dnas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). The basic building blocks employed are [depthwise convolutions](https://paperswithcode.com/method/depthwise-convolution) and a [residual connection](https://paperswithcode.com/method/residual-connection)." ; skos:prefLabel "FBNet Block" . :FCN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Fully Convolutional Network" ; skos:definition """**Fully Convolutional Networks**, or **FCNs**, are an architecture used mainly for semantic segmentation. They employ solely locally connected layers, such as [convolution](https://paperswithcode.com/method/convolution), pooling and upsampling. Avoiding the use of dense layers means less parameters (making the networks faster to train). It also means an FCN can work for variable image sizes given all connections are local.\r \r The network consists of a downsampling path, used to extract and interpret the context, and an upsampling path, which allows for localization. \r \r FCNs also employ skip connections to recover the fine-grained spatial information lost in the downsampling path.""" ; skos:prefLabel "FCN" . :FCOS a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**FCOS** is an anchor-box free, proposal free, single-stage object detection model. By eliminating the predefined set of anchor boxes, FCOS avoids computation related to anchor boxes such as calculating overlapping during training. It also avoids all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance." ; skos:prefLabel "FCOS" . :FCPose a skos:Concept ; dcterms:source ; skos:definition """**FCPose** is a fully convolutional multi-person [pose estimation framework](https://paperswithcode.com/methods/category/pose-estimation-models) using dynamic instance-aware convolutions. Different from existing methods, which often require ROI (Region of Interest) operations and/or grouping post-processing, FCPose eliminates the ROIs and grouping pre-processing with dynamic instance aware keypoint estimation heads. The dynamic keypoint heads are conditioned on each instance (person), and can encode the instance concept in the dynamically-generated weights of their filters. \r \r Overall, FCPose is built upon the one-stage object detector [FCOS](https://paperswithcode.com/method/fcos). The controller that generates the weights of the keypoint heads is attached to the FCOS heads. The weights $\\theta\\_{i}$ generated by the controller is used to fulfill the keypoint head $f$ for the instance $i$. Moreover, a keypoint refinement module is introduced to predict the offsets from each location of the heatmaps to the ground-truth keypoints. Finally, the coordinates derived from the predicted heatmaps are refined by the offsets predicted by the keypoint refinement module, resulting in the final keypoint results. "Rel. coord." is a map of the relative coordinates from all the locations of the feature maps $F$ to the location where the weights are generated. The relative coordinate map is concatenated to $F$ as the input to the keypoint head.""" ; skos:prefLabel "FCPose" . :FEFM a skos:Concept ; dcterms:source ; skos:altLabel "Field Embedded Factorization Machine" ; skos:definition """**Field Embedded Factorization Machine**, or **FEFM**, is a factorization machine variant. For each field pair, FEFM introduces symmetric matrix embeddings along with the usual feature vector embeddings that are present in FM. Like FM, $v\\_{i}$ is the vector embedding of the $i^{t h}$ feature. However, unlike Field-Aware Factorization Machines (FFMs), FEFM doesn't explicitly learn field-specific feature embeddings. The learnable symmetric matrix $W\\_{F(i), F(j)}$ is the embedding for the field pair $F(i)$ and $F(j) .$ The interaction between the $i^{t h}$ feature and the $j^{t h}$ feature is mediated through $W_{F(i), F(j)} .$\r \r $$\r \\phi(\\theta, x)=\\phi\\_{F E F M}((w, v, W), x)=w\\_{0}+\\sum\\_{i=1}^{m} w_{i} x_{i}+\\sum\\_{i=1}^{m} \\sum\\_{j=i+1}^{m} v\\_{i}^{T} W\\_{F(i), F(j)} v\\_{j} x\\_{i} x\\_{j}\r $$\r \r where $W\\_{F(i), F(j)}$ is a $k \\times k$ symmetric matrix ( $k$ is the dimension of the feature vector embedding space containing feature vectors $v\\_{i}$ and $v\\_{j}$ ).\r \r The symmetric property of the learnable matrix $W\\_{F(i), F(j)}$ is ensured by reparameterizing $W\\_{F(i), F(j)}$ as $U\\_{F(i), F(j)}+$ $U\\_{F(i), F(j)}^{T}$, where $U\\_{F(i), F(j)}^{T}$ is the transpose of the learnable matrix $U\\_{F(i), F(j)} .$ Note that $W_{F(i), F(j)}$ can also be interpreted as a vector transformation matrix which transforms a feature embedding when interacting with a specific field.""" ; skos:prefLabel "FEFM" . :FFB6D a skos:Concept ; dcterms:source ; skos:definition "**FFB6D** is a full flow bidirectional fusion network for 6D pose estimation of known objects from a single RGBD image. Unlike previous works that extract the RGB and point cloud features independently and fuse them in the final stage, FFB6D builds bidirectional fusion modules as communication bridges in the full flow of the two networks. In this way, the two networks can obtain complementary information from the other and learn representations containing rich appearance and geometry information of the scene." ; skos:prefLabel "FFB6D" . :FFF a skos:Concept ; dcterms:source ; skos:altLabel "Fast Feedforward Networks" ; skos:definition "A log-time alternative to feedforward layers outperforming both the vanilla feedforward and mixture-of-experts approaches." ; skos:prefLabel "FFF" . :FFMv1 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Feature Fusion Module v1" ; skos:definition "**Feature Fusion Module v1** is a feature fusion module from the [M2Det](https://paperswithcode.com/method/m2det) object detection model, and feature fusion modules are crucial for constructing the final multi-level feature pyramid. They use [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) layers to compress the channels of the input features and use concatenation operation to aggregate these feature map. FFMv1 takes two feature maps with different scales in backbone as input, it adopts one upsample operation to rescale the deep features to the same scale before the concatenation operation." ; skos:prefLabel "FFMv1" . :FFMv2 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Feature Fusion Module v2" ; skos:definition "**Feature Fusion Module v2** is a feature fusion module from the [M2Det](https://paperswithcode.com/method/m2det) object detection model, and is crucial for constructing the final multi-level feature pyramid. They use [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) layers to compress the channels of the input features and use a concatenation operation to aggregate these feature map. FFMv2 takes the base feature and the largest output feature map of the previous [Thinned U-Shape Module](https://paperswithcode.com/method/tum) (TUM) – these two are of the same scale – as input, and produces the fused feature for the next TUM." ; skos:prefLabel "FFMv2" . :FGA a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Factor Graph Attention" ; skos:definition "A general multimodal attention unit for any number of modalities. Graphical models inspire it, i.e., it infers several attention beliefs via aggregated interaction messages." ; skos:prefLabel "FGA" . :FIERCE a skos:Concept ; dcterms:source ; skos:altLabel "Feature Information Entropy Regularized Cross Entropy" ; skos:definition "FIERCE is an entropic regularization on the **feature** space" ; skos:prefLabel "FIERCE" . :FINCHClustering a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "First Integer Neighbor Clustering Hierarchy (FINCH))" ; skos:definition "FINCH is a parameter-free fast and scalable clustering algorithm. it stands out for its speed and clustering quality." ; skos:prefLabel "FINCH Clustering" . :FLAVA a skos:Concept ; dcterms:source ; skos:definition "FLAVA aims at building a single holistic universal model that targets all modalities at once. FLAVA is a language vision alignment model that learns strong representations from multimodal data (image-text pairs) and unimodal data (unpaired images and text). The model consists of an image encode transformer to capture unimodal image representations, a text encoder transformer to process unimodal text information, and a multimodal encode transformer that takes as input the encoded unimodal image and text and integrates their representations for multimodal reasoning. During pretraining, masked image modeling (MIM) and mask language modeling (MLM) losses are applied onto the image and text encoders over a single image or a text piece, respectively, while contrastive, masked multimodal modeling (MMM), and image-text matching (ITM) loss are used over paired image-text data. For downstream tasks, classification heads are applied on the outputs from the image, text, and multimodal encoders respectively for visual recognition, language understanding, and multimodal reasoning tasks It can be applied to broad scope of tasks from three domains (visual recognition, language understanding, and multimodal reasoning) under a common transformer model architecture." ; skos:prefLabel "FLAVA" . :FLAVR a skos:Concept ; dcterms:source ; skos:definition """**FLAVR** is an architecture for video frame interpolation. It uses 3D space-time convolutions to enable end-to-end learning and inference for video frame interpolation. Overall, it consists of a [U-Net](https://paperswithcode.com/method/u-net) style architecture with 3D space-time convolutions and\r deconvolutions (yellow blocks). Channel gating is used after all (de-)[convolution](https://paperswithcode.com/method/convolution) layers (blue blocks). The final prediction layer (the purple block) is implemented as a convolution layer to project the 3D feature maps into $(k−1)$ frame predictions. This design allows FLAVR to predict multiple frames in one inference forward pass.""" ; skos:prefLabel "FLAVR" . :FLICA a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "A Framework for Leader Identification in Coordinated Activity" ; skos:definition """An agreement of a group to follow a common purpose is manifested by its coalescence into a coordinated behavior. The process of initiating this behavior and the period of decision-making by the group members necessarily precedes the coordinated behavior. Given time series of group members’ behavior, the goal is to find these periods of decision-making and identify the initiating individual, if one exists.\r \r Image Source: [Amornbunchornvej et al.](https://arxiv.org/pdf/1603.01570v2.pdf)""" ; skos:prefLabel "FLICA" . :FMix a skos:Concept ; dcterms:source ; skos:definition "A variant of [CutMix](https://paperswithcode.com/method/cutmix) which randomly samples masks from Fourier space." ; skos:prefLabel "FMix" . :FMwithsplines a skos:Concept ; dcterms:source ; skos:altLabel "Factorization machines with cubic splines for numerical features" ; skos:definition "Using cubic splines to improve factorization machine accuracy with numerical features" ; skos:prefLabel "FM with splines" . :FORK a skos:Concept ; dcterms:source ; skos:altLabel "Forward-Looking Actor" ; skos:definition """**FORK**, or **Forward Looking Actor** is a type of actor for actor-critic algorithms. In particular, FORK includes a neural network that forecasts the next state given the current state and current action, called system network; and a neural network that forecasts the\r reward given a (state, action) pair, called reward network. With the system network and reward network, FORK can forecast the next state and consider the value of the next state when improving the policy.""" ; skos:prefLabel "FORK" . :FPG a skos:Concept ; dcterms:source ; skos:altLabel "Feature Pyramid Grid" ; skos:definition """**Feature Pyramid Grids**, or **FPG**, is a deep multi-pathway feature pyramid, that represents the feature scale-space as a regular grid of parallel bottom-up pathways which are fused by multi-directional lateral connections. It connects the backbone features, $C$, of a ConvNet with a regular structure of $p$ parallel top-down pyramid pathways which are fused by multi-directional lateral connections, AcrossSame, AcrossUp, AcrossDown, and AcrossSkip. AcrossSkip are direct connections while all other types use [convolutional](https://paperswithcode.com/method/convolution) and [ReLU](https://paperswithcode.com/method/relu) layers.\r \r On a high-level, FPG is a deep generalization of [FPN](https://paperswithcode.com/method/fpn) from one to $p$ pathways under a dense lateral connectivity structure.""" ; skos:prefLabel "FPG" . :FPN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Feature Pyramid Network" ; skos:definition """A **Feature Pyramid Network**, or **FPN**, is a feature extractor that takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures. It therefore acts as a generic solution for building feature pyramids inside deep convolutional networks to be used in tasks like object detection.\r \r The construction of the pyramid involves a bottom-up pathway and a top-down pathway.\r \r The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. For the feature\r pyramid, one pyramid level is defined for each stage. The output of the last layer of each stage is used as a reference set of feature maps. For [ResNets](https://paperswithcode.com/method/resnet) we use the feature activations output by each stage’s last [residual block](https://paperswithcode.com/method/residual-block). \r \r The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.""" ; skos:prefLabel "FPN" . :FRILL a skos:Concept ; dcterms:source ; skos:definition "**FRILL** is a non-semantic speech embedding model trained via knowledge distillation that is fast enough to be run in real-time on a mobile device. The fastest model runs at 0.9 ms, which is 300x faster than TRILL and 25x faster than TRILL-distilled." ; skos:prefLabel "FRILL" . :FSAF a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**FSAF**, or Feature Selective Anchor-Free, is a building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work jointly with anchor-based branches by outputting predictions in parallel. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy\r \r The general concept is presented in the Figure to the right. An anchor-free branch is built per level of feature pyramid, independent to the anchor-based branch. Similar to the anchor-based branch, it consists of a classification subnet and a regression subnet (not shown in figure). An instance can be assigned to arbitrary level of the anchor-free branch. During training, we dynamically select the most suitable level of feature for each instance based on the instance content instead of just the size of instance box. The selected level of feature then learns to detect the assigned instances. At inference, the FSAF module can run independently or jointly with anchor-based branches. The FSAF module is agnostic to the backbone network and can be applied to single-shot detectors with a structure of feature pyramid. Additionally, the instantiation of anchor-free branches and online feature selection can be various.""" ; skos:prefLabel "FSAF" . :FT-Transformer a skos:Concept ; dcterms:source ; skos:definition "FT-Transformer (Feature Tokenizer + Transformer) is a simple adaptation of the [Transformer](/method/transformer) architecture for the tabular domain. The model (Feature Tokenizer component) transforms all features (categorical and numerical) to tokens and runs a stack of Transformer layers over the tokens, so every Transformer layer operates on the feature level of one object. (This model is similar to [AutoInt](/method/autoint)). In the Transformer component, the `[CLS]` token is appended to $T$. Then $L$ Transformer layers are applied. PreNorm is used for easier optimization and good performance. The final representation of the `[CLS]` token is used for prediction." ; skos:prefLabel "FT-Transformer" . :FactorizedDenseSynthesizedAttention a skos:Concept ; dcterms:source ; skos:definition """**Factorized Dense Synthesized Attention** is a synthesized attention mechanism, similar to [dense synthesized attention](https://paperswithcode.com/method/dense-synthesized-attention), but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture. The factorized variant of the dense synthesizer can be expressed as follows:\r \r $$A, B = F\\_{A}\\left(X\\_{i}\\right), F\\_{B}\\left(X\\_{i}\\right)$$\r \r where $F\\_{A}\\left(.\\right)$ projects input $X\\_{i}$ into $a$ dimensions, $F\\_B\\left(.\\right)$ projects $X\\_{i}$ to $b$ dimensions, and $a \\text{ x } b = l$. The output of the factorized module is now written as:\r \r $$ Y = \\text{Softmax}\\left(C\\right)G\\left(X\\right) $$\r \r where $C = H\\_{A}\\left(A\\right) * H\\_{B}\\left(B\\right)$, where $H\\_{A}$, $H\\_{B}$ are tiling functions and $C \\in \\mathbb{R}^{l \\text{ x } l}$. The tiling function simply duplicates the vector $k$ times, i.e., $\\mathbb{R}^{l} \\rightarrow \\mathbb{R}^{lk}$. In this case, $H\\_{A}\\left(\\right)$ is a projection of $\\mathbb{R}^{a} \\rightarrow \\mathbb{R}^{ab}$ and $H\\_{B}\\left(\\right)$ is a projection of $\\mathbb{R}^{b} \\rightarrow \\mathbb{R}^{ba}$. To avoid having similar values within the same block, we compose the outputs of $H\\_{A}$ and $H\\_{B}$.""" ; skos:prefLabel "Factorized Dense Synthesized Attention" . :FactorizedRandomSynthesizedAttention a skos:Concept ; dcterms:source ; skos:definition """**Factorized Random Synthesized Attention**, introduced with the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture, is similar to [factorized dense synthesized attention](https://paperswithcode.com/method/factorized-dense-synthesized-attention) but for random synthesizers. Letting $R$ being a randomly initialized matrix, we factorize $R$ into low rank matrices $R\\_{1}, R\\_{2} \\in \\mathbb{R}^{l\\text{ x}k}$ in the attention function:\r \r $$ Y = \\text{Softmax}\\left(R\\_{1}R\\_{2}^{T}\\right)G\\left(X\\right) . $$\r \r Here $G\\left(.\\right)$ is a parameterized function that is equivalent to $V$ in [Scaled Dot-Product Attention](https://paperswithcode.com/method/scaled).\r \r For each head, the factorization reduces the parameter costs from $l^{2}$ to $2\\left(lk\\right)$ where\r $k << l$ and hence helps prevent overfitting. In practice, we use a small value of $k = 8$.\r \r The basic idea of a Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples.""" ; skos:prefLabel "Factorized Random Synthesized Attention" . :FairMOT a skos:Concept ; dcterms:source ; skos:definition "**FairMOT** is a model for multi-object tracking which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks is used to achieve high levels of detection and tracking accuracy. The detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. It is also worth noting that FairMOT operates on high-resolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of high-resolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy." ; skos:prefLabel "FairMOT" . :FashionCLIP a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "FashionCLIP is a fine-tuned CLIP model on fashion data (more than 800K pairs). It is the first foundation model for Fashion." ; skos:prefLabel "FashionCLIP" . :Fast-OCR a skos:Concept ; dcterms:source ; skos:definition "Fast-OCR is a new lightweight detection network that incorporates features from existing models focused on the speed/accuracy trade-off, such as [YOLOv2](https://paperswithcode.com/method/yolov2), [CR-NET](https://paperswithcode.com/method/cr-net), and Fast-[YOLOv4](https://paperswithcode.com/method/yolov4)." ; skos:prefLabel "Fast-OCR" . :Fast-YOLOv2 a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Fast-YOLOv2" . :Fast-YOLOv3 a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Fast-YOLOv3" . :Fast-YOLOv4-SmallObj a skos:Concept ; dcterms:source ; skos:definition "The Fast-YOLOv4-SmallObj model is a modified version of Fast-[YOLOv4](https://paperswithcode.com/method/yolov4) to improve the detection of small objects. Seven layers were added so that it predicts bounding boxes at 3 different scales instead of 2." ; skos:prefLabel "Fast-YOLOv4-SmallObj" . :FastAutoAugment a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Fast AutoAugment** is an image data augmentation algorithm that finds effective augmentation policies via a search strategy based on density matching, motivated by Bayesian DA. The strategy is to improve the generalization performance of a given network by learning the augmentation policies which treat augmented data as missing data points of training data. However, different from Bayesian DA, the proposed method recovers those missing data points by the exploitation-and-exploration of a family of inference-time augmentations via Bayesian optimization in the policy search phase. This is realized by using an efficient density matching algorithm that does not require any back-propagation for network training for each policy evaluation." ; skos:prefLabel "Fast AutoAugment" . :FastGCN a skos:Concept ; dcterms:source ; skos:definition """FastGCN is a fast improvement of the GCN model recently proposed by Kipf & Welling (2016a) for learning graph embeddings. It generalizes transductive training to an inductive manner and also addresses the memory bottleneck issue of GCN caused by recursive expansion of neighborhoods. The crucial ingredient is a sampling scheme in the reformulation of the loss and the gradient, well justified through an alternative view of graph convoluntions in the form of integral transforms of embedding functions.\r \r Description and image from: [FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling](https://arxiv.org/pdf/1801.10247.pdf)""" ; skos:prefLabel "FastGCN" . :FastMinimum-NormAttack a skos:Concept ; dcterms:source ; skos:definition "**Fast Minimum-Norm Attack**, or **FNM**, is a type of adversarial attack that works with different $\\ell_{p}$-norm perturbation models ($p=0,1,2,\\infty$), is robust to hyperparameter choices, does not require adversarial starting points, and converges within few lightweight steps. It works by iteratively finding the sample misclassified with maximum confidence within an $\\ell_{p}$-norm constraint of size $\\epsilon$, while adapting $\\epsilon$ to minimize the distance of the current sample to the decision boundary." ; skos:prefLabel "Fast Minimum-Norm Attack" . :FastMoE a skos:Concept ; dcterms:source ; skos:definition "**FastMoE ** is a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and adaption to different applications, such as [Transformer-XL](https://paperswithcode.com/method/transformer-xl) and Megatron-LM." ; skos:prefLabel "FastMoE" . :FastPitch a skos:Concept ; dcterms:source ; skos:definition """**FastPitch** is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward [Transformer](https://paperswithcode.com/method/transformer) (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let $x=\\left(x\\_{1}, \\ldots, x\\_{n}\\right)$ be the sequence of input lexical units, and $\\mathbf{y}=\\left(y\\_{1}, \\ldots, y\\_{t}\\right)$ be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation $\\mathbf{h}=\\operatorname{FFTr}(\\mathbf{x})$. The hidden representation $h$ is used to make predictions about the duration and average pitch of every character with a 1-D CNN \r \r $$\r \\hat{\\mathbf{d}}=\\text { DurationPredictor }(\\mathbf{h}), \\quad \\hat{\\mathbf{p}}=\\operatorname{PitchPredictor}(\\mathbf{h})\r $$\r \r where $\\hat{\\mathbf{d}} \\in \\mathbb{N}^{n}$ and $\\hat{\\mathbf{p}} \\in \\mathbb{R}^{n}$. Next, the pitch is projected to match the dimensionality of the hidden representation $h \\in$ $\\mathbb{R}^{n \\times d}$ and added to $\\mathbf{h}$. The resulting sum $\\mathbf{g}$ is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence\r \r $$\r \\mathbf{g}=\\mathbf{h}+\\operatorname{PitchEmbedding}(\\mathbf{p})\r $$\r \r $$\r \\hat{\\mathbf{y}}=\\operatorname{FFTr}\\left([\\underbrace{g\\_{1}, \\ldots, g\\_{1}}\\_{d\\_{1}}, \\ldots \\underbrace{g\\_{n}, \\ldots, g\\_{n}}_{d\\_{n}}]\\right)\r $$\r \r \r Ground truth $\\mathbf{p}$ and $\\mathbf{d}$ are used during training, and predicted $\\hat{\\mathbf{p}}$ and $\\hat{\\mathbf{d}}$ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities\r \r $$\r \\mathcal{L}=\\|\\hat{\\mathbf{y}}-\\mathbf{y}\\|\\_{2}^{2}+\\alpha\\|\\hat{\\mathbf{p}}-\\mathbf{p}\\|\\_{2}^{2}+\\gamma\\|\\hat{\\mathbf{d}}-\\mathbf{d}\\|\\_{2}^{2}\r $$""" ; skos:prefLabel "FastPitch" . :FastR-CNN a skos:Concept ; dcterms:source ; skos:definition "**Fast R-CNN** is an object detection model that improves in its predecessor [R-CNN](https://paperswithcode.com/method/r-cnn) in a number of ways. Instead of extracting CNN features independently for each region of interest, Fast R-CNN aggregates them into a single forward pass over the image; i.e. regions of interest from the same image share computation and memory in the forward and backward passes." ; skos:prefLabel "Fast R-CNN" . :FastSGT a skos:Concept ; dcterms:source ; skos:definition """**Fast Schema Guided Tracker**, or **FastSGT**, is a fast and robust [BERT](https://paperswithcode.com/method/bert)-based model for state tracking in goal-oriented dialogue systems. The model employs carry-over mechanisms for transferring the values between slots, enabling switching between services and accepting the values offered by the system during dialogue. It also uses [multi-head attention](https://paperswithcode.com/method/multi-head-attention) projections in some of the decoders to have a better modelling of the encoder outputs.\r \r The model architecture is illustrated in the Figure. It consists of four main modules: 1-Utterance Encoder, 2-Schema Encoder, 3-State Decoder, and 4-State Tracker. The first three modules constitute the NLU component and are based on neural networks, whereas the state tracker is a rule-based module. [BERT](https://paperswithcode.com/method/bert) was used for both encoders in the model.\r \r The Utterance Encoder is a BERT model which encodes the user and system utterances at each turn. The Schema Encoder is also a BERT model which encodes the schema descriptions of intents, slots, and values into schema embeddings. These schema embeddings help the decoders to transfer or share knowledge between different services by having some language understanding of each slot, intent, or value. The schema and utterance embeddings are passed to the State Decoder - a multi-task module. This module consists of five sub-modules producing the information necessary to track the state of the dialogue. Finally, the State Tracker module takes the previous state along with the current outputs of the State Decoder and predicts the current state of the dialogue by aggregating and summarizing the information across turns.""" ; skos:prefLabel "FastSGT" . :FastSampleRe-Weighting a skos:Concept ; dcterms:source ; skos:definition "**Fast Sample Re-Weighting**, or **FSR**, is a sample re-weighting strategy to tackle problems such as dataset biases, noisy labels and imbalanced classes. It leverages a dictionary (essentially an extra buffer) to monitor the training history reflected by the model updates during meta optimization periodically, and utilises a valuation function to discover meaningful samples from training data as the proxy of reward data. The unbiased dictionary keeps being updated and provides reward signals to optimize sample weights. Additionally, instead of maintaining model states for both model and sample weight updates separately, feature sharing is enabled for saving the computation cost used for maintaining respective states." ; skos:prefLabel "Fast Sample Re-Weighting" . :FastSpeech2 a skos:Concept ; dcterms:source ; skos:definition """**FastSpeech2** is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.\r \r The encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward [Transformer](https://paperswithcode.com/method/transformer) block, which is a stack of [self-attention](https://paperswithcode.com/method/multi-head-attention) and 1D-[convolution](https://paperswithcode.com/method/convolution) as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.""" ; skos:prefLabel "FastSpeech 2" . :FastSpeech2s a skos:Concept ; dcterms:source ; skos:definition """**FastSpeech 2s** is a text-to-speech model that abandons mel-spectrograms as intermediate output completely and directly generates speech waveform from text during inference. In other words there is no cascaded mel-spectrogram generation (acoustic model) and waveform generation (vocoder). FastSpeech 2s generates waveform conditioning on intermediate hidden, which makes it more compact in inference by discarding the mel-spectrogram decoder.\r \r Two main design changes are made to the waveform decoder. \r \r First, considering that the phase information is difficult to predict using a variance predictor, [adversarial training](https://paperswithcode.com/methods/category/adversarial-training) is used in the waveform decoder to force it to implicitly recover the phase information by itself. \r \r Secondly, the mel-spectrogram decoder of [FastSpeech 2](https://paperswithcode.com/method/fastspeech-2) is leveraged, which is trained on the full text sequence to help on the text feature extraction. As shown in the Figure, the waveform decoder is based on the structure of [WaveNet](https://paperswithcode.com/method/wavenet) including non-causal convolutions and gated activation. The waveform decoder takes a sliced hidden sequence corresponding to a short audio clip as input and upsamples it with transposed 1D-convolution to match the length of audio clip. The discriminator in the adversarial training adopts the same structure in Parallel WaveGAN, which consists of ten layers of non-causal [dilated 1-D convolutions](https://paperswithcode.com/method/dilated-convolution) with [leaky ReLU](https://paperswithcode.com/method/leaky-relu) activation function. The waveform decoder is optimized by the multi-resolution STFT loss and the [LSGAN discriminator](https://paperswithcode.com/method/lsgan) loss following Parallel WaveGAN. \r \r In inference, the mel-spectrogram decoder is discarded and only the waveform decoder is used to synthesize speech audio.""" ; skos:prefLabel "FastSpeech 2s" . :FastVoxelQuery a skos:Concept ; dcterms:source ; skos:definition "**Fast Voxel Query** is a module used in the [Voxel Transformer](https://paperswithcode.com/method/votr) 3D object detection model implementation of self-attention, specifically Local and Dilated Attention. For each querying index $v\\_{i}$, an attending voxel index $v\\_{j}$ is determined by Local and Dilated Attention. Then we can lookup the non-empty index $j$ in the hash table with hashed $v\\_{j}$ as the key. Finally, the non-empty index $j$ is used to gather the attending feature $f\\_{j}$ from $\\mathcal{F}$ for [multi-head attention](https://paperswithcode.com/method/multi-head-attention)." ; skos:prefLabel "Fast Voxel Query" . :Fast_BAT a skos:Concept ; dcterms:source ; skos:altLabel "Fast Bi-level Adversarial Training" ; skos:definition "Fast-BAT is a new method for accelerated adversarial training." ; skos:prefLabel "Fast_BAT" . :FasterR-CNN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Faster R-CNN** is an object detection model that improves on [Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn) by utilising a region proposal network ([RPN](https://paperswithcode.com/method/rpn)) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by [Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn) for detection. RPN and Fast [R-CNN](https://paperswithcode.com/method/r-cnn) are merged into a single network by sharing their convolutional features: the RPN component tells the unified network where to look.\r \r As a whole, Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.""" ; skos:prefLabel "Faster R-CNN" . :Fastformer a skos:Concept ; dcterms:source ; skos:definition "**Fastformer** is an type of [Transformer](https://paperswithcode.com/method/transformer) which uses [additive attention](https://www.paperswithcode.com/method/additive-attention) as a building block. Instead of modeling the pair-wise interactions between tokens, [additive attention](https://paperswithcode.com/method/additive-attention) is used to model global contexts, and then each token representation is further transformed based on its interaction with global context representations." ; skos:prefLabel "Fastformer" . :Fawkes a skos:Concept ; dcterms:source ; skos:definition "**Fawkes** is an image cloaking system that helps individuals inoculate their images against unauthorized facial recognition models. Fawkes achieves this by helping users add imperceptible pixel-level changes (\"cloaks\") to their own photos before releasing them. When used to train facial recognition models, these \"cloaked\" images produce functional models that consistently cause normal images of the user to be misidentified." ; skos:prefLabel "Fawkes" . :FcaNet a skos:Concept ; dcterms:source ; skos:altLabel "Frequency channel attention networks" ; skos:definition """FCANet contains a novel multi-spectral channel attention module. Given an input feature map $X \\in \\mathbb{R}^{C \\times H \\times W}$, multi-spectral channel attention first splits $X$ into many parts $x^{i} \\in \\mathbb{R}^{C' \\times H \\times W}$. Then it applies a 2D DCT to each part $x^{i}$. Note that a 2D DCT can use pre-processing results to reduce computation. After processing each part, all results are concatenated into a vector. Finally, fully connected layers, ReLU activation and a sigmoid are used to get the attention vector as in an SE block. This can be formulated as:\r \\begin{align}\r s = F_\\text{fca}(X, \\theta) & = \\sigma (W_{2} \\delta (W_{1}[(\\text{DCT}(\\text{Group}(X)))]))\r \\end{align}\r \\begin{align}\r Y & = s X\r \\end{align}\r where $\\text{Group}(\\cdot)$ indicates dividing the input into many groups and $\\text{DCT}(\\cdot)$ is the 2D discrete cosine transform. \r \r This work based on information compression and discrete cosine transforms achieves excellent performance on the classification task.""" ; skos:prefLabel "FcaNet" . :Feature-CentricVoting a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Feature-Centric Voting" . :FeatureIntertwiner a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Feature Intertwiner** is an object detection module that leverages the features from a more reliable set to help guide the feature learning of another less reliable set. The mutual learning process helps two sets to have closer distance within the cluster in each class. The intertwiner is applied on the object detection task, where a historical buffer is proposed to address the sample missing problem during one mini-batch and the optimal transport (OT) theory is introduced to enforce the similarity among the two sets." ; skos:prefLabel "Feature Intertwiner" . :FeatureNMS a skos:Concept ; dcterms:source ; skos:definition "**Feature Non-Maximum Suppression**, or **FeatureNMS**, is a post-processing step for object detection models that removes duplicates where there are multiple detections outputted per object. FeatureNMS recognizes duplicates not only based on the intersection over union between the bounding boxes, but also based on the difference of feature vectors. These feature vectors can encode more information like visual appearance." ; skos:prefLabel "FeatureNMS" . :FeatureSelection a skos:Concept ; dcterms:source ; skos:definition "Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction." ; skos:prefLabel "Feature Selection" . :FeedbackMemory a skos:Concept ; dcterms:source ; skos:definition """**Feedback Memory** is a type of attention module used in the [Feedback Transformer](https://paperswithcode.com/method/feedback-transformer) architecture. It allows a [transformer](https://paperswithcode.com/method/transformer) to to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.:\r \r $$ \\mathbf{z}^{l}\\_{t} = \\text{Attn}\\left(\\mathbf{x}^{l}\\_{t}, \\left[\\mathbf{m}\\_{t−\\tau}, \\dots, \\mathbf{m}\\_{t−1}\\right]\\right) $$\r \r where a memory vector $\\mathbf{m}\\_{t}$ is computed by summing the representations of each layer at the $t$-th time step:\r \r $$ \\mathbf{m}\\_{t} = \\sum^{L}\\_{l=0}\\text{Softmax}\\left(w^{l}\\right)\\mathbf{x}\\_{t}^{l} $$\r \r where $w^{l}$ are learnable scalar parameters. Here $l = 0$ corresponds to token embeddings. The weighting of different layers by a [softmax](https://paperswithcode.com/method/softmax) output gives the model more flexibility as it can average them or select one of them. This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in the Figure. Indeed, it gives the ability to formulate the representation $\\mathbf{x}^{l}\\_{t+1}$ based on past representations from any layer $l'$, while in a standard Transformer this is only true for $l > l'$. This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture.""" ; skos:prefLabel "Feedback Memory" . :FeedbackTransformer a skos:Concept ; dcterms:source ; skos:definition "A **Feedback Transformer** is a type of sequential transformer that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. This feedback nature allows this architecture to perform recursive computation, building stronger representations iteratively upon previous states. To achieve this, the self-attention mechanism of the standard [Transformer](https://paperswithcode.com/method/transformer) is modified so it attends to higher level representations rather than lower ones." ; skos:prefLabel "Feedback Transformer" . :FeedforwardNetwork a skos:Concept ; skos:definition """A **Feedforward Network**, or a **Multilayer Perceptron (MLP)**, is a neural network with solely densely connected layers. This is the classic neural network architecture of the literature. It consists of inputs $x$ passed through units $h$ (of which there can be many layers) to predict a target $y$. Activation functions are generally chosen to be non-linear to allow for flexible functional approximation.\r \r Image Source: Deep Learning, Goodfellow et al""" ; skos:prefLabel "Feedforward Network" . :FiLMModule a skos:Concept ; dcterms:source ; skos:definition """The **Feature-wise linear modulation** (**FiLM**) module combines information from both noisy waveform and input mel-spectrogram. It is used in the [WaveGrad](https://paperswithcode.com/method/wavegrad) model. The authors also added iteration index $n$ which indicates the noise level of the input waveform by using the [Transformer](https://paperswithcode.com/method/transformer) sinusoidal positional embedding. To condition on the noise level directly, $n$ is replaced by $\\sqrt{\\bar{\\alpha}}$ and a linear scale $C = 5000$ is applied. The FiLM module produces both scale and bias vectors given inputs, which are used in a UBlock for feature-wise affine transformation as:\r \r $$ \\gamma\\left(D, \\sqrt{\\bar{\\alpha}}\\right) \\odot U + \\zeta\\left(D, \\sqrt{\\bar{\\alpha}}\\right) $$\r \r where $\\gamma$ and $\\zeta$ correspond to the scaling and shift vectors from the FiLM module, $D$ is the output from corresponding [DBlock](https://paperswithcode.com/method/dblock), $U$ is an intermediate output in the UBlock.""" ; skos:prefLabel "FiLM Module" . :FilterResponseNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Filter Response Normalization (FRN)** is a type of normalization that combines normalization and an activation function, which can be used as a replacement for other normalizations and activations. It operates on each activation channel of each batch element independently, eliminating the dependency on other batch elements. \r \r To demonstrate, assume we are dealing with the feed-forward convolutional neural network. We follow the usual convention that the filter responses (activation maps) produced after a [convolution](https://paperswithcode.com/method/convolution) operation are a [4D ](https://paperswithcode.com/method/4d-a)tensor $X$ with shape $[B, W, H, C]$, where $B$ is the mini-batch size, $W, H$ are the spatial extents of the map, and $C$ is the number of filters used in convolution. $C$ is also referred to as output channels. Let $x = X_{b,:,:,c} \\in \\mathcal{R}^{N}$, where $N = W \\times H$, be the vector of filter responses for the $c^{th}$ filter for the $b^{th}$ batch point. \r Let $\\nu^2 = \\sum\\_i x_i^2/N$, be the mean squared norm of $x$. \r \r Then Filter Response Normalization is defined as the following:\r \r $$\r \\hat{x} = \\frac{x}{\\sqrt{\\nu^2 + \\epsilon}},\r $$\r \r where $\\epsilon$ is a small positive constant to prevent division by zero. \r \r A lack of mean centering in FRN can lead to activations having an arbitrary bias away from zero. Such a bias in conjunction with [ReLU](https://paperswithcode.com/method/relu) can have a detrimental effect on learning and lead to poor performance and dead units. To address this the authors augment ReLU with a learned threshold $\\tau$ to yield:\r \r $$\r z = \\max(y, \\tau)\r $$\r \r Since $\\max(y, \\tau){=}\\max(y-\\tau,0){+}\\tau{=}\\text{ReLU}{(y{-}\\tau)}{+}\\tau$, the effect of this activation is the same as having a shared bias before and after ReLU.""" ; skos:prefLabel "Filter Response Normalization" . :FireModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Fire Module** is a building block for convolutional neural networks, notably used as part of [SqueezeNet](https://paperswithcode.com/method/squeezenet). A Fire module is comprised of: a squeeze [convolution](https://paperswithcode.com/method/convolution) layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters. We expose three tunable dimensions (hyperparameters) in a Fire module: $s\\_{1x1}$, $e\\_{1x1}$, and $e\\_{3x3}$. In a Fire module, $s\\_{1x1}$ is the number of filters in the squeeze layer (all 1x1), $e\\_{1x1}$ is the number of 1x1 filters in the expand layer, and $e\\_{3x3}$ is the number of 3x3 filters in the expand layer. When we use Fire modules we set $s\\_{1x1}$ to be less than ($e\\_{1x1}$ + $e\\_{3x3}$), so the squeeze layer helps to limit the number of input channels to the 3x3 filters." ; skos:prefLabel "Fire Module" . :Fireflyalgorithm a skos:Concept ; dcterms:source ; skos:definition "Metaheuristic algorithm" ; skos:prefLabel "Firefly algorithm" . :Fisher-BRC a skos:Concept ; dcterms:source ; skos:definition "**Fisher-BRC** is an actor critic algorithm for offline reinforcement learning that encourages the learned policy to stay close to the data, namely parameterizing the critic as the $\\log$-behavior-policy, which generated the offline dataset, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. A gradient penalty regularizer is used for the offset term, which is equivalent to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature." ; skos:prefLabel "Fisher-BRC" . :Fishr a skos:Concept ; dcterms:source ; skos:definition "**Fishr** is a learning scheme to enforce domain invariance in the space of the gradients of the loss function: specifically, it introduces a regularization term that matches the domain-level variances of gradients across training domains. Critically, the strategy exhibits close relations with the Fisher Information and the Hessian of the loss. Forcing domain-level gradient covariances to be similar during the learning procedure eventually aligns the domain-level loss landscapes locally around the final weights." ; skos:prefLabel "Fishr" . :FixMatch a skos:Concept ; dcterms:source ; skos:definition """FixMatch is an algorithm that first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image.\r \r Description from: [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https://paperswithcode.com/paper/fixmatch-simplifying-semi-supervised-learning)\r \r Image credit: [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https://paperswithcode.com/paper/fixmatch-simplifying-semi-supervised-learning)""" ; skos:prefLabel "FixMatch" . :FixRes a skos:Concept ; dcterms:source ; skos:definition "**FixRes** is an image scaling strategy that seeks to optimize classifier performance. It is motivated by the observation that data augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! FixRes is a simple strategy to optimize the classifier performance, that employs different train and test resolutions. The calibrations are: (a) calibrating the object sizes by adjusting the crop size and (b) adjusting statistics before spatial pooling." ; skos:prefLabel "FixRes" . :FixedFactorizedAttention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Fixed Factorized Attention** is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the [Sparse Transformer](https://paperswithcode.com/method/sparse-transformer) architecture.\r \r \r A self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r \r $$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r \r $$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r \r $$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r \r $$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r \r Here $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r \r Full self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r \r Factorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} ⊂ \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https://paperswithcode.com/method/transformer) was to find efficient choices for the subset $A$.\r \r Formally for Fixed Factorized Attention, $A^{(1)}\\_{i} = ${$j : \\left(\\lfloor{j/l\\rfloor}=\\lfloor{i/l\\rfloor}\\right)$}, where the brackets denote the floor operation, and $A^{(2)}\\_{i} = ${$j : j \\mod l \\in ${$t, t+1, \\ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r \r If the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth. \r \r A fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \\in ${$8, 16, 32$} for typical values of $l \\in\r {128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the [strided attention](https://paperswithcode.com/method/strided-attention).\r \r Additionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.""" ; skos:prefLabel "Fixed Factorized Attention" . :FixupInitialization a skos:Concept ; dcterms:source ; skos:definition """**FixUp Initialization**, or **Fixed-Update Initialization**, is an initialization method that rescales the standard initialization of [residual branches](https://paperswithcode.com/method/residual-block) by adjusting for the network architecture. Fixup aims to enables training very deep [residual networks](https://paperswithcode.com/method/resnet) stably at a maximal learning rate without [normalization](https://paperswithcode.com/methods/category/normalization).\r \r The steps are as follows:\r \r 1. Initialize the classification layer and the last layer of each residual branch to 0.\r \r 2. Initialize every other layer using a standard method, e.g. [Kaiming Initialization](https://paperswithcode.com/method/he-initialization), and scale only the weight layers inside residual branches by $L^{\\frac{1}{2m-2}}$.\r \r 3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each [convolution](https://paperswithcode.com/method/convolution), [linear](https://paperswithcode.com/method/linear-layer), and element-wise activation layer.""" ; skos:prefLabel "Fixup Initialization" . :Flan-T5 a skos:Concept ; dcterms:source ; skos:definition "**Flan-T5** is the instruction fine-tuned version of **T5** or **Text-to-Text Transfer Transformer** Language Model." ; skos:prefLabel "Flan-T5" . :FlexFlow a skos:Concept ; skos:definition """**FlexFlow** is a deep learning engine that uses guided randomized search of the SOAP (Sample, Operator, Attribute, and Parameter) space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy’s performance and is three orders of magnitude faster than prior approaches that execute each strategy. \r \r FlexFlow uses two main components: a fast, incremental execution simulator to evaluate different parallelization strategies, and a Markov Chain Monte Carlo (MCMC) search algorithm that takes advantage of the incremental simulator to rapidly explore the large search space.""" ; skos:prefLabel "FlexFlow" . :Florence a skos:Concept ; dcterms:source ; skos:definition "Florence is a computer vision foundation model aiming to learn universal visual-language representations that be adapted to various computer vision tasks, visual question answering, image captioning, video retrieval, among other tasks. Florence's workflow consists of data curation, unified learning, Transformer architectures and adaption. Florence is pre-trained in an image-label-description space, utilizing a unified image-text contrastive learning. It involves a two-tower architecture: 12-layer Transformer for the language encoder, and a Vision Transformer for the image encoder. Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features. Compared to previous methods for cross-modal shared representations, Florence expands beyond simple classification and retrieval capabilities to advanced representations that support object level, multiple modality, and videos respectively." ; skos:prefLabel "Florence" . :FlowAlignmentModule a skos:Concept ; dcterms:source ; skos:definition """**Flow Alignment Module**, or **FAM**, is a flow-based align module for scene parsing to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high resolution features effectively and efficiently. The concept of Semantic Flow is inspired from optical flow, which is widely used in video processing task to represent the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by relative motion. The authors postulate that the relatinship between two feature maps of arbitrary resolutions from the same image can also be represented with the “motion” of every pixel from one feature map to the other one. Once precise Semantic Flow is obtained, the network is able to propagate semantic features with minimal information loss.\r \r In the FAM module, the transformed high-resolution feature map are combined with the low-resolution feature map to generate the semantic flow field, which is utilized to warp the low-resolution feature map to high-resolution feature map.""" ; skos:prefLabel "Flow Alignment Module" . :FocalLoss a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Focal Loss** function addresses class imbalance during training in tasks like object detection. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r \r Formally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r \r $$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$""" ; skos:prefLabel "Focal Loss" . :FocalTransformers a skos:Concept ; dcterms:source ; skos:definition "The **focal self-attention** is built to make Transformer layers scalable to high-resolution inputs. Instead of attending all tokens at fine-grain, the approach attends the fine-grain tokens only locally, but the summarized ones globally. As such, it can cover as many regions as standard self-attention but with much less cost. An image is first partitioned into patches, resulting in visual tokens. Then a patch embedding layer, consisting of a convolutional layer with filter and stride of same size, to project the patches into hidden features. This spatial feature map in then passed to four stages of focal Transformer blocks. Each focal Transformer block consists of $N_i$ focal Transformer layers. Patch embedding layers are used in between to reduce spatial size of feature map by factor 2, while feature dimension increased by 2." ; skos:prefLabel "Focal Transformers" . :Focus a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Focus" . :Forwardgradient a skos:Concept ; dcterms:source ; skos:definition """Forward gradients are unbiased estimators of the gradient $\\nabla f(\\theta)$ for a function $f: \\mathbb{R}^n \\rightarrow \\mathbb{R}$, given by $g(\\theta) = \\langle \\nabla f(\\theta) , v \\rangle v$. \r \r Here $v = (v_1, \\ldots, v_n)$ is a random vector, which must satisfy the following conditions in order for $g(\\theta)$ to be an unbiased estimator of $\\nabla f(\\theta)$\r \r * $v_i \\perp v_j$ for all $i \\neq j$\r * $\\mathbb{E}[v_i] = 0$ for all $i$\r * $\\mathbb{V}[v_i] = 1$ for all $i$\r \r Forward gradients can be computed with a single jvp (Jacobian Vector Product), which enables the use of the forward mode of autodifferentiation instead of the usual reverse mode, which has worse computational characteristics.""" ; skos:prefLabel "Forward gradient" . :FourierContourEmbedding a skos:Concept ; dcterms:source ; skos:definition "**Fourier Contour Embedding** is a text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. This motivates modeling text instances in the Fourier domain." ; skos:prefLabel "Fourier Contour Embedding" . :FoveaBox a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**FoveaBox** is anchor-free framework for object detection. Instead of using predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image\r \r It is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-shelf convolutional network. The first subnet performs per pixel classification on the backbone’s output; the second subnet performs bounding box prediction for the corresponding\r position.""" ; skos:prefLabel "FoveaBox" . :FractalBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Fractal Block** is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where $f\\_{1}\\left(z\\right) = \\text{conv}\\left(z\\right)$ is a convolutional layer, we then have recursive fractals of the form:\r \r $$ f\\_{C+1}\\left(z\\right) = \\left[\\left(f\\_{C}\\circ{f\\_{C}}\\right)\\left(z\\right)\\right] \\oplus \\left[\\text{conv}\\left(z\\right)\\right]$$\r \r Where $C$ is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.""" ; skos:prefLabel "Fractal Block" . :FractalNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**FractalNet** is a type of convolutional neural network that eschews [residual connections](https://paperswithcode.com/method/residual-connection) in favour of a \"fractal\" design. They involve repeated application of a simple expansion rule to generate deep networks whose structural layouts are precisely truncated fractals. These networks contain interacting subpaths of different lengths, but do not include any pass-through or residual connections; every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers." ; skos:prefLabel "FractalNet" . :Fragmentation a skos:Concept ; dcterms:source ; skos:definition "Given a pattern $P,$ that is more complicated than the patterns, we fragment $P$ into simpler patterns such that their exact count is known. In the subgraph GNN proposed earlier, look into the subgraph of the host graph. We have seen that this technique is scalable on large graphs. Also, we have seen that subgraph GNN is more expressive and efficient than traditional GNN. So, we tried to explore the expressibility when the pattern is fragmented into smaller subpatterns." ; skos:prefLabel "Fragmentation" . :FraternalDropout a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Fraternal Dropout** is a regularization method for recurrent neural networks that trains two identical copies of an RNN (that share parameters) with different [dropout](https://paperswithcode.com/method/dropout) masks while minimizing the difference between their (pre-[softmax](https://paperswithcode.com/method/softmax)) predictions. This encourages the representations of RNNs to be invariant to dropout mask, thus being robust." ; skos:prefLabel "Fraternal Dropout" . :FreeAnchor a skos:Concept ; dcterms:source ; skos:definition "**FreeAnchor** is an anchor supervision method for object detection. Many CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In contrast, FreeAnchor is a learning-to-match approach that breaks the IoU restriction, allowing objects to match anchors in a flexible manner. It updates hand-crafted anchor assignment to free anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization." ; skos:prefLabel "FreeAnchor" . :FunnelTransformer a skos:Concept ; dcterms:source ; skos:definition """**Funnel Transformer** is a type of [Transformer](https://paperswithcode.com/methods/category/transformers) that gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. By re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, the model capacity is further improved. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-[transformer](https://paperswithcode.com/method/transformer) is able to recover a deep representation for each token from the reduced hidden sequence via a decoder.\r \r The proposed model keeps the same overall skeleton of interleaved S-[Attn](https://paperswithcode.com/method/scaled) and P-[FFN](https://paperswithcode.com/method/dense-connections) sub-modules wrapped by [residual connection](https://paperswithcode.com/method/residual-connection) and [layer normalization](https://paperswithcode.com/method/layer-normalization). But differently, to achieve representation compression and computation reduction, THE model employs an encoder that gradually reduces the sequence length of the hidden states as the layer gets deeper. In addition, for tasks involving per-token predictions like pretraining, a simple decoder is used to reconstruct a full sequence of token-level representations from the compressed encoder output. Compression is achieved via a pooling operation,""" ; skos:prefLabel "Funnel Transformer" . :FuseFormer a skos:Concept ; dcterms:source ; skos:definition "**FuseFormer** is a [Transformer](https://paperswithcode.com/method/transformer)-based model designed for video inpainting via fine-grained feature fusion based on novel [Soft Split and Soft Composition](https://paperswithcode.com/method/soft-split-and-soft-composition) operations. The soft split divides feature map into many patches with given overlapping interval while the soft composition stitches them back into a whole feature map where pixels in overlapping regions are summed up. FuseFormer builds soft composition and soft split into its [feedforward network](https://paperswithcode.com/method/feedforward-network) for further enhancing subpatch level feature fusion." ; skos:prefLabel "FuseFormer" . :FuseFormerBlock a skos:Concept ; dcterms:source ; skos:definition "A **FuseFormer block** is used in the [FuseFormer](https://paperswithcode.com/method/fuseformer) model for video inpainting. It is the same to standard [Transformer](https://paperswithcode.com/method/transformer) block except that feed forward network is replaced with a Fusion Feed Forward Network (F3N). F3N brings no extra parameter into the standard feed forward net and the difference is that F3N inserts a soft-split and a soft composite operation between the two layer of MLPs." ; skos:prefLabel "FuseFormer Block" . :G-GLN a skos:Concept ; dcterms:source ; skos:altLabel "Gaussian Gated Linear Network" ; skos:definition """**Gaussian Gated Linear Network**, or **G-GLN**, is a multi-variate extension to the recently proposed [GLN](https://paperswithcode.com/method/gln) family of deep neural networks by reformulating the GLN neuron as a gated product of Gaussians. This Gaussian Gated Linear Network (G-GLN) formulation exploits the fact that exponential family densities are closed under multiplication, a property that has seen much use in [Gaussian Process](https://paperswithcode.com/method/gaussian-process) and related literature. Similar to the Bernoulli GLN, every neuron in the G-GLN directly predicts the target distribution. \r \r Precisely, a G-GLN is a feed-forward network of data-dependent distributions. Each neuron calculates the sufficient statistics $\\left(\\mu, \\sigma\\_{2}\\right)$ for its associated PDF using its active weights, given those emitted by neurons in the preceding layer. It consists of consists of $L+1$ layers indexed by $i \\in\\{0, \\ldots, L\\}$ with $K\\_{i}$ neurons in each layer. The weight space for a neuron in layer $i$ is denoted by $\\mathcal{W}\\_{i}$; the subscript is needed since the dimension of the weight space depends on $K_{i-1}$. Each neuron/distribution is indexed by its position in the network when laid out on a grid; for example, $f\\_{i k}$ refers to the family of PDFs defined by the $k$ th neuron in the $i$ th layer. Similarly, $c\\_{i k}$ refers to the context function associated with each neuron in layers $i \\geq 1$, and $\\mu\\_{i k}$ and $\\sigma\\_{i k}^{2}$ (or $\\Sigma\\_{i k}$ in the multivariate case) referring to the sufficient statistics for each Gaussian PDF.\r \r There are two types of input to neurons in the network. The first is the side information, which can be thought of as the input features, and is used to determine the weights used by each neuron via half-space gating. The second is the input to the neuron, which is the PDFs output by the previous layer, or in the case of layer 0, some provided base models. To apply a G-GLN in a supervised learning setting, we need to map the sequence of input-label pairs $\\left(x\\_{t}, y\\_{t}\\right)$ for $t=1,2, \\ldots$ onto a sequence of (side information, base Gaussian PDFs, label) triplets $\\left(z\\_{t},\\left\\(f\\_{0 i}\\right\\)\\_{i}, y\\_{t}\\right)$. The side information $z\\_{t}$ is set to the (potentially normalized) input features $x\\_{t}$. The Gaussian PDFs for layer 0 will generally include the necessary base Gaussian PDFs to span the target range, and optionally some base prediction PDFs that capture domain-specific knowledge.""" ; skos:prefLabel "G-GLN" . :G-GLNNeuron a skos:Concept ; dcterms:source ; skos:definition """A **G-GLN Neuron** is a type of neuron used in the [G-GLN](https://paperswithcode.com/method/g-gln) architecture. G-GLN. The key idea is that further representational power can be added to a weighted product of Gaussians via a contextual gating procedure. This is achieved by extending a weighted product of Gaussians model with an additional type of input called side information. The side information will be used by a neuron to select a weight vector to apply for a given example from a table of weight vectors. In typical applications to regression, the side information is defined as the (normalized) input features for an input example: i.e. $z=(x-\\bar{x}) / \\sigma\\_{x}$.\r \r More formally, associated with each neuron is a context function $c: \\mathcal{Z} \\rightarrow \\mathcal{C}$, where $\\mathcal{Z}$ is the set of possible side information and $\\mathcal{C}=\\{0, \\ldots, k-1\\}$ for some $k \\in \\mathbb{N}$ is the context space. Each neuron $i$ is now parameterized by a weight matrix $W\\_{i}=\\left[w\\_{i, 0} \\ldots w\\_{i, k-1}\\right]^{\\top}$ with each row vector $w\\_{i j} \\in \\mathcal{W}$ for $0 \\leq j ; skos:altLabel "Generalizable Node Injection Attack" ; skos:definition """**Generalizable Node Injection Attack**, or **G-NIA**, is an attack scenario for graph neural networks where the attacker injects malicious nodes rather than modifying original nodes or edges to affect the performance of GNNs. G-NIA generates the discrete edges also by Gumbel-Top-𝑘 following OPTI and captures the coupling effect between network structure and node features by a sophisticated designed model. \r \r G-NIA explicitly models the most critical feature propagation via jointly modeling. Specifically, the malicious attributes are adopted to guide the generation of edges, modeling the influence of attributes and edges. G-NIA also adopts a model-based framework, utilizing useful information of attacking during model training, as well as saving computational cost during inference without re-optimization.""" ; skos:prefLabel "G-NIA" . :G3D a skos:Concept ; dcterms:source ; skos:definition "**G3D** is a unified spatial-temporal graph convolutional operator that directly models cross-spacetime joint dependencies. It leverages dense cross-spacetime edges as skip connections for direct information propagation across the 3D spatial-temporal graph." ; skos:prefLabel "G3D" . :GA a skos:Concept ; dcterms:source ; skos:altLabel "Genetic Algorithms" ; skos:definition "Genetic Algorithms are search algorithms that mimic Darwinian biological evolution in order to select and propagate better solutions." ; skos:prefLabel "GA" . :GAGNN a skos:Concept ; dcterms:source ; skos:altLabel "Group-Aware Neural Network" ; skos:definition "**GAGNN**, or **Group-aware Graph Neural Network**, is a hierarchical model for nationwide city air quality forecasting. The model constructs a city graph and a city group graph to model the spatial and latent dependencies between cities, respectively. GAGNN introduces differentiable grouping network to discover the latent dependencies among cities and generate city groups. Based on the generated city groups, a group correlation encoding module is introduced to learn the correlations between them, which can effectively capture the dependencies between city groups. After the graph construction, GAGNN implements message passing mechanism to model the dependencies between cities and city groups." ; skos:prefLabel "GAGNN" . :GAIL a skos:Concept ; dcterms:source ; skos:altLabel "Generative Adversarial Imitation Learning" ; skos:definition "**Generative Adversarial Imitation Learning** presents a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning." ; skos:prefLabel "GAIL" . :GALA a skos:Concept ; dcterms:source ; skos:altLabel "Global-and-Local attention" ; skos:definition """Most attention mechanisms learn where to focus using only weak supervisory signals from class labels, which inspired Linsley et al. to investigate how explicit human supervision can affect the performance and interpretability of attention models. As a proof of concept, Linsley et al. proposed the global-and-local attention (GALA) module, which extends an SE block with a spatial attention mechanism.\r \r Given the input feature map $X$, GALA uses an attention mask that combines global and local attention to tell the network where and on what to focus. As in SE blocks, global attention aggregates global information by global average pooling and then produces a channel-wise attention weight vector using a multilayer perceptron. In local attention, two consecutive $1\\times 1$ convolutions are conducted on the input to produce a positional weight map. The outputs of the local and global pathways are combined by addition and multiplication. Formally, GALA can be represented as:\r \\begin{align}\r s_g &= W_{2} \\delta (W_{1}\\text{GAP}(x))\r \\end{align}\r \r \\begin{align}\r s_l &= Conv_2^{1\\times 1} (\\delta(Conv_1^{1\\times1}(X)))\r \\end{align}\r \r \\begin{align}\r s_g^* &= \\text{Expand}(s_g)\r \\end{align}\r \r \\begin{align}\r s_l^* &= \\text{Expand}(s_l) \r \\end{align}\r \r \\begin{align}\r s &= \\tanh(a(s_g^\\* + s_l^\\*) +m \\cdot (s_g^\\* s_l^\\*) )\r \\end{align}\r \r \\begin{align}\r Y &= sX\r \\end{align}\r \r where $a,m \\in \\mathbb{R}^{C}$ are learnable parameters representing channel-wise weight vectors. \r \r Supervised by human-provided feature importance maps, GALA has significantly improved representational power and can be combined with any CNN backbone.""" ; skos:prefLabel "GALA" . :GAM a skos:Concept ; skos:altLabel "Generalized additive models" ; skos:definition "Please enter a description about the method here" ; skos:prefLabel "GAM" . :GAN-TTS a skos:Concept ; dcterms:source ; skos:definition """**GAN-TTS** is a generative adversarial network for text-to-speech synthesis. The architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyze the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced.\r \r The generator architecture consists of several GBlocks, which are residual based (dilated) [convolution](https://paperswithcode.com/method/convolution) blocks. GBlocks 3–7 gradually upsample the temporal dimension of hidden representations by factors of 2, 2, 2, 3, 5, while the number of channels is reduced by GBlocks 3, 6 and 7 (by a factor of 2 each). The final convolutional layer with [Tanh activation](https://paperswithcode.com/method/tanh-activation) produces a single-channel audio waveform.\r \r Instead of a single discriminator, GAN-TTS uses an ensemble of Random Window Discriminators (RWDs) which operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways.""" ; skos:prefLabel "GAN-TTS" . :GANFeatureMatching a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Feature Matching** is a regularizing objective for a generator in [generative adversarial networks](https://paperswithcode.com/methods/category/generative-adversarial-networks) that prevents it from overtraining on the current discriminator. Instead of directly maximizing the output of the discriminator, the new objective requires the generator to generate data that matches the statistics of the real data, where we use the discriminator only to specify the statistics that we think are worth matching. Specifically, we train the generator to match the expected value of the features on an intermediate layer of the discriminator. This is a natural choice of statistics for the generator to match, since by training the discriminator we ask it to find those features that are most discriminative of real data versus data generated by the current model.\r \r Letting $\\mathbf{f}\\left(\\mathbf{x}\\right)$ denote activations on an intermediate layer of the discriminator, our new objective for the generator is defined as: $ ||\\mathbb{E}\\_{x\\sim p\\_{data} } \\mathbf{f}\\left(\\mathbf{x}\\right) − \\mathbb{E}\\_{\\mathbf{z}∼p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\mathbf{f}\\left(G\\left(\\mathbf{z}\\right)\\right)||^{2}\\_{2} $. The discriminator, and hence\r $\\mathbf{f}\\left(\\mathbf{x}\\right)$, are trained as with vanilla GANs. As with regular [GAN](https://paperswithcode.com/method/gan) training, the objective has a fixed point where G exactly matches the distribution of training data.""" ; skos:prefLabel "GAN Feature Matching" . :GANHingeLoss a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """The **GAN Hinge Loss** is a hinge loss based loss function for [generative adversarial networks](https://paperswithcode.com/methods/category/generative-adversarial-networks):\r \r $$ L\\_{D} = -\\mathbb{E}\\_{\\left(x, y\\right)\\sim{p}\\_{data}}\\left[\\min\\left(0, -1 + D\\left(x, y\\right)\\right)\\right] -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}\\left[\\min\\left(0, -1 - D\\left(G\\left(z\\right), y\\right)\\right)\\right] $$\r \r $$ L\\_{G} = -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}D\\left(G\\left(z\\right), y\\right) $$""" ; skos:prefLabel "GAN Hinge Loss" . :GANLeastSquaresLoss a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**GAN Least Squares Loss** is a least squares loss function for generative adversarial networks. Minimizing this objective function is equivalent to minimizing the Pearson $\\chi^{2}$ divergence. The objective function (here for [LSGAN](https://paperswithcode.com/method/lsgan)) can be defined as:\r \r $$ \\min\\_{D}V\\_{LS}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{data}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r \r $$ \\min\\_{G}V\\_{LS}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r \r where $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.""" ; skos:prefLabel "GAN Least Squares Loss" . :GANformer a skos:Concept ; dcterms:source ; skos:altLabel "Generative Adversarial Transformer" ; skos:definition """GANformer is a novel and efficient type of [transformer](https://paperswithcode.com/method/transformer) which can be used for visual generative modeling. The network employs a bipartite structure that enables long-range interactions across an image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes.\r \r Source: [Generative Adversarial Transformers](https://arxiv.org/pdf/2103.01209v2.pdf)\r \r Image source: [Generative Adversarial Transformers](https://arxiv.org/pdf/2103.01209v2.pdf)""" ; skos:prefLabel "GANformer" . :GAP-Layer a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Spectral Gap Rewiring Layer" ; skos:definition """**TL;DR: GAP-Layer is a GNN Layer which is able to rewire a graph in an inductive an parameter-free way optimizing the spectral gap (minimizing or maximizing the bottleneck size), learning a differentiable way to compute the Fiedler vector and the Fiedler value of the graph.**\r \r ## Summary\r **GAP-Layer** is a rewiring layer based on minimizing or maximizing the spectral gap (or graph bottleneck size) in an inductive way. Depending on the mining task we want to perform in our graph, we would like to maximize or minimize the size of the bottleneck, aiming to more connected or more separated communities. \r \r ## GAP-Layer: Spectral Gap Rewiring\r \r #### Loss and derivatives using $\\mathbf{L}$ or $\\mathbf{\\cal L}$\r For this explanation, we are going to suppose we want to minimize the spectral gap, i.e. make the graph bottleneck size smaller. For minimizing the spectral GAP we minimize this loss:\r \r $$\r L\\_{Fiedler} = \\|\\tilde{\\mathbf{A}}-\\mathbf{A}\\| \\_F + \\alpha(\\lambda\\_2)^2\r $$\r \r The gradients of this cost function w.r.t each element of $\\mathbf{A}$ are not trivial. Depending on if we use the Laplacian, $\\mathbf{L}$, or the normalized Laplacian, $\\cal L$, the derivatives are going to be different. For the former case ($\\mathbf{L}$), we will use the derivatives presented in Kang et al. 2019. In the latter scenario ($\\cal L$), we present the **Spectral Gradients**: derivatives from the spectral gap w.r.t. the Normalized Laplacian. However, whatever option we choose, $\\lambda_2$ can seen as a function of $\\tilde{\\mathbf{A}}$ and , hence, $\\nabla\\_{\\tilde{\\mathbf{A}}}\\lambda\\_2$, the gradient of $\\lambda\\_2$ wrt each component of $\\tilde{\\mathbf{A}}$ (*how does the bottleneck change with each change in our graph?*), comes from the chain rule of the matrix derivative $Tr\\left[\\left(\\nabla\\_{\\tilde{\\mathbf{L}}}\\lambda\\_2\\right)^T\\cdot\\nabla\\_{\\tilde{\\mathbf{A}}}\\tilde{\\mathbf{L}}\\right]$ if using the Laplacian or $Tr\\left[\\left(\\nabla\\_{\\tilde{\\mathbf{\\cal L}}}\\lambda\\_2\\right)^T\\cdot\\nabla\\_{\\tilde{\\mathbf{A}}}\\tilde{\\mathbf{\\cal L}}\\right]$ if using the normalized Laplacian. Both of this derivatives, relies on the Fiedler vector (2nd eigenvector: $\\mathbf{f}\\_2$ if we use $\\mathbf{L}$ and $\\mathbf{g}\\_2$ if using $\\mathbf{\\cal L}$ instead). For more details on those derivatives, and for the sake of simplicity in this blog explanation, I suggest go to the original paper.\r \r #### Differentiable approximation of $\\mathbf{f}_2$ and $\\lambda_2$\r Once we have those derivatives, the problem is still not that trivial. Note that our cost function $L\\_{Fiedler}$, relies on an eigenvalue $\\lambda\\_2$. In addition, the derivatives also depends on the Fiedler vector $\\mathbf{f}\\_2$ or $\\mathbf{g}\\_2$, which is the eigenvector corresponding to the aforementioned eigenvalue. However, we **DO NOT COMPUTE IT SPECTRALLY**, as its computation has a complexity of $O(n^3)$ and would need to be computed in every learning iteration. Instead, **we learn an approximation of $\\mathbf{f}\\_2$ and use its Dirichlet energy ${\\cal E}(\\mathbf{f}\\_2)$ to approximate the $\\lambda_2$**. \r $$\r \\mathbf{f}\\_2(u) = \\begin{array}{cl}\r +1/\\sqrt{n} & \\text{if}\\;\\; u\\;\\; \\text{belongs to the first cluster} \\\\\r -1/\\sqrt{n} & \\text{if}\\;\\; u\\;\\; \\text{belongs to the second cluster} \r \\end{array} \r $$\r In addition, if using $\\mathbf{\\cal L}$, since $\\mathbf{g}\\_2=\\mathbf{D}^{1/2}\\mathbf{f}_2$, we first approximate $\\mathbf{g}_2$ and then approximate $\\lambda_2$ from ${\\cal E}(\\mathbf{g}\\_2)$. With this approximation, we can easily compute the node belonging to each cluster with a simple MLP. In addition, such as the Fiedler value must satisfy orthogonality and normality, restrictions must be added to that MLP Clustering.\r \r ### GAP-Layer\r To sum up, **GAP-Layer** can be defined as the following. Given the matrix $\\mathbf{X}\\_{n\\times F}$ encoding the features of the nodes after any message passing (MP) layer, $\\mathbf{S}\\_{n\\times 2}=\\textrm{Softmax}(\\textrm{MLP}(\\mathbf{X}))$ learns the association $\\mathbf{X}\\rightarrow \\mathbf{S}$ while $\\mathbf{S}$ is optimized according to the loss:\r \r $$\r L\\_{Cut} = -\\frac{Tr[\\mathbf{S}^T\\mathbf{A}\\mathbf{S}]}{Tr[\\mathbf{S}^T\\mathbf{D}\\mathbf{S}]} + \\left\\|\\frac{\\mathbf{S}^T\\mathbf{S}}{\\|\\mathbf{S}^T\\mathbf{S}\\|\\_F} - \\frac{\\mathbf{I}\\_n}{\\sqrt{2}}\\right\\|\\_F\r $$\r Then, the $\\mathbf{f}\\_2$ is approximated from $\\mathbf{S}$ using $\\mathbf{f}\\_2(u)$ equation. Once calculated $\\mathbf{f}\\_2$ and $\\lambda\\_2$ we consider the loss:\r \r $$\r L\\_{Fiedler} = \\|\\tilde{\\mathbf{A}}-\\mathbf{A}\\|\\_F + \\alpha(\\lambda\\_2)^2\r $$\r $$\\mathbf{\\tilde{A}} = \\mathbf{A} - \\mu \\nabla_\\mathbf{\\tilde{A}}\\lambda\\_2$$\r returning $\\tilde{\\mathbf{A}}$. Then the GAP diffusion $\\mathbf{T}^{GAP} = \\tilde{\\mathbf{A}}(\\mathbf{S}) \\odot \\mathbf{A}$ results from minimizing \r \r $$L_{GAP}= L\\_{Cut} + L\\_{Fiedler}$$\r \r \r **References**\r (Kang et al. 2019) Kang, J., & Tong, H. (2019, November). N2n: Network derivative mining. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 861-870).""" ; skos:prefLabel "GAP-Layer" . :GAT a skos:Concept ; dcterms:source ; skos:altLabel "Graph Attention Network" ; skos:definition """A **Graph Attention Network (GAT)** is a neural network architecture that operates on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods’ features, a GAT enables (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront.\r \r See [here](https://docs.dgl.ai/en/0.4.x/tutorials/models/1_gnn/9_gat.html) for an explanation by DGL.""" ; skos:prefLabel "GAT" . :GATv2 a skos:Concept ; dcterms:source ; skos:altLabel "Graph Attention Network v2" ; skos:definition """The __GATv2__ operator from the [“How Attentive are Graph Attention Networks?”](https://arxiv.org/abs/2105.14491) paper, which fixes the static attention problem of the standard [GAT](https://paperswithcode.com/method/gat) layer: since the linear layers in the standard GAT are applied right after each other, the ranking of attended nodes is unconditioned on the query node. In contrast, in GATv2, every node can attend to any other node.\r \r GATv2 scoring function:\r \r $e_{i,j} =\\mathbf{a}^{\\top}\\mathrm{LeakyReLU}\\left(\\mathbf{W}[\\mathbf{h}_i \\, \\Vert \\,\\mathbf{h}_j]\\right)$""" ; skos:prefLabel "GATv2" . :GBO a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Gradient-based optimization" ; skos:definition """GBO is a novel metaheuristic optimization algorithm. The GBO, inspired by the gradient-based Newton’s method, uses two main operators: gradient search rule (GSR) and local escaping operator (LEO) and a set of vectors to explore the search space. The GSR employs the gradient-based method to enhance the exploration tendency and accelerate the convergence rate to achieve better positions in the search space. The LEO enables the proposed GBO to escape from local optima. The performance of the new algorithm was evaluated in two phases. 28 mathematical test functions were first used to evaluate various characteristics of the GBO, and then six engineering problems were optimized by the GBO. In the first phase, the GBO was compared with five existing optimization algorithms, indicating that the GBO yielded very promising results due to its enhanced capabilities of exploration, exploitation, convergence, and effective avoidance of local optima. The second phase also demonstrated the superior performance of the GBO in solving complex real-world engineering problems. \r \r * The source codes of GBO are publicly available at https://imanahmadianfar.com/codes/.""" ; skos:prefLabel "GBO" . :GBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**GBlock** is a type of [residual block](https://paperswithcode.com/method/residual-block) used in the [GAN-TTS](https://paperswithcode.com/method/gan-tts) text-to-speech architecture - it is a stack of two residual blocks. As the generator is producing raw audio (e.g. a 2s training clip corresponds\r to a sequence of 48000 samples), dilated convolutions are used to ensure that the receptive field of $G$ is large enough to capture long-term dependencies. The four kernel size-3 convolutions in each GBlock have increasing dilation factors: 1, 2, 4, 8. Convolutions are preceded by Conditional Batch Normalisation, conditioned on the linear embeddings of the noise term $z \\sim N\\left(0, \\mathbf{I}\\_{128}\\right)$ in the single-speaker case, or the concatenation of $z$ and a one-hot representation of the speaker ID in the multi-speaker case. The embeddings are different for\r each BatchNorm instance. \r \r A GBlock contains two skip connections, the first of which in [GAN](https://paperswithcode.com/method/gan)-TTS performs upsampling if the output frequency is higher than the input, and it also contains a size-1 [convolution](https://paperswithcode.com/method/convolution)\r if the number of output channels is different from the input.""" ; skos:prefLabel "GBlock" . :GCN a skos:Concept ; dcterms:source ; skos:altLabel "Graph Convolutional Network" ; skos:definition "A **Graph Convolutional Network**, or **GCN**, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of [convolutional neural networks](https://paperswithcode.com/methods/category/convolutional-neural-networks) which operate directly on graphs. The choice of convolutional architecture is motivated via a localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes." ; skos:prefLabel "GCN" . :GCNFN a skos:Concept ; dcterms:source ; skos:altLabel "Graph Convolutional Networks for Fake News Detection" ; skos:definition "Social media are nowadays one of the main news sources for millions of people around the globe due to their low cost, easy access and rapid dissemination. This however comes at the cost of dubious trustworthiness and significant risk of exposure to 'fake news', intentionally written to mislead the readers. Automatically detecting fake news poses challenges that defy existing content-based analysis approaches. One of the main reasons is that often the interpretation of the news requires the knowledge of political or social context or 'common sense', which current NLP algorithms are still missing. Recent studies have shown that fake and real news spread differently on social media, forming propagation patterns that could be harnessed for the automatic fake news detection. Propagation-based approaches have multiple advantages compared to their content-based counterparts, among which is language independence and better resilience to adversarial attacks. In this paper we show a novel automatic fake news detection model based on geometric deep learning. The underlying core algorithms are a generalization of classical CNNs to graphs, allowing the fusion of heterogeneous data such as content, user profile and activity, social graph, and news propagation. Our model was trained and tested on news stories, verified by professional fact-checking organizations, that were spread on Twitter. Our experiments indicate that social network structure and propagation are important features allowing highly accurate (92.7% ROC AUC) fake news detection. Second, we observe that fake news can be reliably detected at an early stage, after just a few hours of propagation. Third, we test the aging of our model on training and testing data separated in time. Our results point to the promise of propagation-based approaches for fake news detection as an alternative or complementary strategy to content-based approaches." ; skos:prefLabel "GCNFN" . :GCNII a skos:Concept ; dcterms:source ; skos:definition "**GCNII** is an extension of a [Graph Convolution Networks](https://www.paperswithcode.com/method/gcn) with two new techniques, initial residual and identify mapping, to tackle the problem of oversmoothing -- where stacking more layers and adding non-linearity tends to degrade performance. At each layer, initial residual constructs a skip connection from the input layer, while identity mapping adds an identity matrix to the weight matrix." ; skos:prefLabel "GCNII" . :GCNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Global Context Network**, or **GCNet**, utilises global context blocks to model long-range dependencies in images. It is based on the [Non-Local Network](https://paperswithcode.com/method/non-local-block), but it modifies the architecture so less computation is required. Global context blocks are applied to multiple layers in a backbone network to construct the GCNet." ; skos:prefLabel "GCNet" . :GCT a skos:Concept ; dcterms:source ; skos:altLabel "Gated Channel Transformation" ; skos:definition """GCT first collects global information by computing the l2-norm of each channel. Next, a learnable vector $ \\alpha $ is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels.\r \r Unlike previous methods, GCT first collects global information by computing the $l_{2}$-norm of each channel. \r Next, a learnable vector $\\alpha$ is applied to scale the feature.\r Then a competition mechanism is adopted by \r channel normalization to interact between channels. \r Like other common normalization methods, \r a learnable scale parameter $\\gamma$ and bias $\\beta$ are applied to \r rescale the normalization.\r However, unlike previous methods,\r GCT adopts tanh activation to control the attention vector.\r Finally, it not only multiplies the input by the attention vector but also adds an identity connection. GCT can be written as: \r \\begin{align}\r s = F_\\text{gct}(X, \\theta) & = \\tanh (\\gamma CN(\\alpha \\text{Norm}(X)) + \\beta)\r \\end{align}\r \\begin{align}\r Y & = s X + X\r \\end{align}\r \r where $\\alpha$, $\\beta$ and $\\gamma$ are trainable parameters. $\\text{Norm}(\\cdot)$ indicates the $L2$-norm of each channel. $CN$ is channel normalization.\r \r A GCT block has fewer parameters than an SE block, and as it is lightweight, \r can be added after each convolutional layer of a CNN.""" ; skos:prefLabel "GCT" . :GCU a skos:Concept ; dcterms:source ; skos:altLabel "Growing Cosine Unit" ; skos:definition "An oscillatory function defined as $x \\cdot cos(x)$ that reports better performance than Sigmoid, Mish, Swish, and ReLU on several benchmarks." ; skos:prefLabel "GCU" . :GECO a skos:Concept ; dcterms:source ; skos:altLabel "Generalized ELBO with Constrained Optimization" ; skos:definition "" ; skos:prefLabel "GECO" . :GEE a skos:Concept ; dcterms:source ; skos:altLabel "Generative Emotion Estimator" ; skos:definition "" ; skos:prefLabel "GEE" . :GELU a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Gaussian Error Linear Units" ; skos:definition """The **Gaussian Error Linear Unit**, or **GELU**, is an activation function. The GELU activation function is $x\\Phi(x)$, where $\\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their percentile, rather than gates inputs by their sign as in [ReLUs](https://paperswithcode.com/method/relu) ($x\\mathbf{1}_{x>0}$). Consequently the GELU can be thought of as a smoother ReLU.\r \r $$\\text{GELU}\\left(x\\right) = x{P}\\left(X\\leq{x}\\right) = x\\Phi\\left(x\\right) = x \\cdot \\frac{1}{2}\\left[1 + \\text{erf}(x/\\sqrt{2})\\right],$$\r if $X\\sim \\mathcal{N}(0,1)$.\r \r One can approximate the GELU with\r $0.5x\\left(1+\\tanh\\left[\\sqrt{2/\\pi}\\left(x + 0.044715x^{3}\\right)\\right]\\right)$ or $x\\sigma\\left(1.702x\\right),$\r but PyTorch's exact implementation is sufficiently fast such that these approximations may be unnecessary. (See also the [SiLU](https://paperswithcode.com/method/silu) $x\\sigma(x)$ which was also coined in the paper that introduced the GELU.)\r \r GELUs are used in [GPT-3](https://paperswithcode.com/method/gpt-3), [BERT](https://paperswithcode.com/method/bert), and most other Transformers.""" ; skos:prefLabel "GELU" . :GENet a skos:Concept ; dcterms:source ; skos:altLabel "GPU-Efficient Network" ; skos:definition "**GENets**, or **GPU-Efficient Networks**, are a family of efficient models found through [neural architecture search](https://paperswithcode.com/methods/category/neural-architecture-search). The search occurs over several types of convolutional block, which include [depth-wise convolutions](https://paperswithcode.com/method/depthwise-convolution), [batch normalization](https://paperswithcode.com/method/batch-normalization), [ReLU](https://paperswithcode.com/method/relu), and an [inverted bottleneck](https://paperswithcode.com/method/inverted-residual-block) structure." ; skos:prefLabel "GENet" . :GEOMANCER a skos:Concept ; dcterms:source ; skos:altLabel "Geometric Manifold Component Estimator" ; skos:definition "**Geomancer** is a nonparametric algorithm for symmetry-based disentangling of data manifolds. It learns a set of subspaces to assign to each point in the dataset, where each subspace is the tangent space of one disentangled submanifold. This means that geomancer can be used to disentangle manifolds for which there may not be a global axis-aligned coordinate system." ; skos:prefLabel "GEOMANCER" . :GER a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Gait Emotion Recognition" ; skos:definition """We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-[GCN](https://paperswithcode.com/method/gcn)) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the perceived emotion of the human into one of four emotions: happy, sad, angry, or neutral. We train STEP on annotated real-world gait videos, augmented with annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP.\r We also release a novel dataset (E-Gait), which consists of 4,227 human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 88\\% on E-Gait, which is 14--30\\% more accurate over prior methods.""" ; skos:prefLabel "GER" . :GFP-GAN a skos:Concept ; dcterms:source ; skos:definition "**GFP-GAN** is a generative adversarial network for blind face restoration that leverages a generative facial prior (GFP). This Generative Facial Prior (GFP) is incorporated into the face restoration process via channel-split spatial feature transform layers, which allow for a good balance between realness and fidelity. As a whole, the GFP-GAN consists of a degradation removal module ([U-Net](https://paperswithcode.com/method/u-net)) and a pretrained face [StyleGAN](https://paperswithcode.com/method/stylegan) as a facial prior. They are bridged by a latent code mapping and several Channel-Split [Spatial Feature Transform](https://paperswithcode.com/method/spatial-feature-transform) (CS-SFT) layers. During training, 1) intermediate restoration losses are employed to remove complex degradation, 2) Facial component loss with discriminators is used to enhance facial details, and 3) identity preserving loss is used to retain face identity." ; skos:prefLabel "GFP-GAN" . :GFSA a skos:Concept ; dcterms:source ; skos:altLabel "Graph Finite-State Automaton" ; skos:definition "**Graph Finite-State Automaton**, or **GFSA**, is a differentiable layer for learning graph structure that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. This layer can be trained end-to-end to add derived relationships (edges) to arbitrary graph-structured data based on performance on a downstream task." ; skos:prefLabel "GFSA" . :GGS-NNs a skos:Concept ; dcterms:source ; skos:altLabel "Gated Graph Sequence Neural Networks" ; skos:definition """Gated Graph Sequence Neural Networks (GGS-NNs) is a novel graph-based neural network model. GGS-NNs modifies Graph Neural Networks (Scarselli et al., 2009) to use gated recurrent units and modern optimization techniques and then extend to output sequences.\r \r Source: [Li et al.](https://arxiv.org/pdf/1511.05493v4.pdf)\r \r Image source: [Li et al.](https://arxiv.org/pdf/1511.05493v4.pdf)""" ; skos:prefLabel "GGS-NNs" . :GHM-C a skos:Concept ; dcterms:source ; skos:altLabel "Gradient Harmonizing Mechanism C" ; skos:definition "**GHM-C** is a loss function designed to balance the gradient flow for anchor classification. The GHM first performs statistics on the number of examples with similar attributes w.r.t their gradient density and then attaches a harmonizing parameter to the gradient of each example according to the density. The modification of gradient can be equivalently implemented by reformulating the loss function. Embedding the GHM into the classification loss is denoted as GHM-C loss. Since the gradient density is a statistical variable depending on the examples distribution in a mini-batch, GHM-C is a dynamic loss that can adapt to the change of data distribution in each batch as well as to the updating of the model." ; skos:prefLabel "GHM-C" . :GHM-R a skos:Concept ; dcterms:source ; skos:altLabel "Gradient Harmonizing Mechanism R" ; skos:definition "**GHM-R** is a loss function designed to balance the gradient flow for bounding box refinement. The GHM first performs statistics on the number of examples with similar attributes w.r.t their gradient density and then attaches a harmonizing parameter to the gradient of each example according to the density. The modification of gradient can be equivalently implemented by reformulating the loss function. Embedding the GHM into the bounding box regression branch is denoted as GHM-R loss." ; skos:prefLabel "GHM-R" . :GIC a skos:Concept ; dcterms:source ; skos:altLabel "Graph InfoClust" ; skos:definition "" ; skos:prefLabel "GIC" . :GIN a skos:Concept ; dcterms:source ; skos:altLabel "Graph Isomorphism Network" ; skos:definition "Per the authors, Graph Isomorphism Network (GIN) generalizes the WL test and hence achieves maximum discriminative power among GNNs." ; skos:prefLabel "GIN" . :GLIDE a skos:Concept ; dcterms:source ; skos:altLabel "Guided Language to Image Diffusion for Generation and Editing" ; skos:definition "GLIDE is a generative model based on text-guided diffusion models for more photorealistic image generation. Guided diffusion is applied to text-conditional image synthesis and the model is able to handle free-form prompts. The diffusion model uses a text encoder to condition on natural language descriptions. The model is provided with editing capabilities in addition to zero-shot generation, allowing for iterative improvement of model samples to match more complex prompts. The model is fine-tuned to perform image inpainting." ; skos:prefLabel "GLIDE" . :GLM a skos:Concept ; dcterms:source ; skos:definition "**GLM** is a bilingual (English and Chinese) pre-trained transformer-based language model that follow the traditional architecture of decoder-only autoregressive language modeling. It leverages autoregressive blank infilling as its training objective." ; skos:prefLabel "GLM" . :GLN a skos:Concept ; dcterms:source ; skos:altLabel "Gated Linear Network" ; skos:definition """A **Gated Linear Network**, or **GLN**, is a type of backpropagation-free neural architecture. What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. \r \r GLNs are feedforward networks composed of many layers of gated geometric mixing neurons as shown in the Figure . Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron. In a supervised learning setting, a $\\mathrm{GLN}$ is trained on (side information, base predictions, label) triplets $\\left(z\\_{t}, p\\_{t}, x\\_{t}\\right)_{t=1,2,3, \\ldots}$ derived from input-label pairs $\\left(z\\_{t}, x\\_{t}\\right)$. There are two types of input to neurons in the network: the first is the side information $z\\_{t}$, which can be thought of as the input features; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0 , some (optionally) provided base predictions $p\\_{t}$ that typically will be a function of $z\\_{t} .$ Each neuron will also take in a constant bias prediction, which helps empirically and is essential for universality guarantees.\r \r Weights are learnt in a Gated Linear Network using Online Gradient Descent (OGD) locally at each neuron. They key observation is that as each neuron $(i, k)$ in layers $i>0$ is itself a gated geometric mixture, all of these neurons can be thought of as individually predicting the target. Given side information $z$ , each neuron $(i, k)$ suffers a loss convex in its active weights $u:=w\\_{i k c\\_{i k}(z)}$ of\r $$\r \\ell\\_{t}(u):=-\\log \\left(\\operatorname{GEO}\\_{u}\\left(x_{t} ; p\\_{i-1}\\right)\\right)\r $$""" ; skos:prefLabel "GLN" . :GLOW a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**GLOW** is a type of flow-based generative model that is based on an invertible $1 \\times 1$ [convolution](https://paperswithcode.com/method/convolution). This builds on the flows introduced by [NICE](https://paperswithcode.com/method/nice) and [RealNVP](https://paperswithcode.com/method/realnvp). It consists of a series of steps of flow, combined in a multi-scale architecture; see the Figure to the right. Each step of flow consists of Act Normalization followed by an *invertible $1 \\times 1$ convolution* followed by an [affine coupling](https://paperswithcode.com/method/affine-coupling) layer." ; skos:prefLabel "GLOW" . :GLU a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Gated Linear Unit" ; skos:definition """A **Gated Linear Unit**, or **GLU** computes:\r \r $$ \\text{GLU}\\left(a, b\\right) = a\\otimes \\sigma\\left(b\\right) $$\r \r It is used in natural language processing architectures, for example the [Gated CNN](https://paperswithcode.com/method/gated-convolution-network), because here $b$ is the gate that control what information from $a$ is passed up to the following layer. Intuitively, for a language modeling task, the gating mechanism allows selection of words or features that are important for predicting the next word. The GLU also has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.""" ; skos:prefLabel "GLU" . :GMI a skos:Concept ; dcterms:source ; skos:altLabel "Graphic Mutual Information" ; skos:definition "**Graphic Mutual Information**, or **GMI**, measures the correlation between input graphs and high-level hidden representations. GMI generalizes the idea of conventional mutual information computations from vector space to the graph domain where measuring mutual information from two aspects of node features and topological structure is indispensable. GMI exhibits several benefits: First, it is invariant to the isomorphic transformation of input graphs---an inevitable constraint in many existing graph representation learning algorithms; Besides, it can be efficiently estimated and maximized by current mutual information estimation methods such as MINE." ; skos:prefLabel "GMI" . :GMVAE a skos:Concept ; dcterms:source ; skos:altLabel "Gaussian Mixture Variational Autoencoder" ; skos:definition "**GMVAE**, or **Gaussian Mixture Variational Autoencoder**, is a stochastic regularization layer for [transformers](https://paperswithcode.com/methods/category/transformers). A GMVAE layer is trained using a 700-dimensional internal representation of the first MLP layer. For every output from the first MLP layer, the GMVAE layer first computes a latent low-dimensional representation sampling from the GMVAE posterior distribution to then provide at the output a reconstruction sampled from a generative model." ; skos:prefLabel "GMVAE" . :GNNCL a skos:Concept ; dcterms:source ; skos:altLabel "Graph Neural Networks with Continual Learning" ; skos:definition "Although significant effort has been applied to fact-checking, the prevalence of fake news over social media, which has profound impact on justice, public trust and our society, remains a serious problem. In this work, we focus on propagation-based fake news detection, as recent studies have demonstrated that fake news and real news spread differently online. Specifically, considering the capability of graph neural networks (GNNs) in dealing with non-Euclidean data, we use GNNs to differentiate between the propagation patterns of fake and real news on social media. In particular, we concentrate on two questions: (1) Without relying on any text information, e.g., tweet content, replies and user descriptions, how accurately can GNNs identify fake news? Machine learning models are known to be vulnerable to adversarial attacks, and avoiding the dependence on text-based features can make the model less susceptible to the manipulation of advanced fake news fabricators. (2) How to deal with new, unseen data? In other words, how does a GNN trained on a given dataset perform on a new and potentially vastly different dataset? If it achieves unsatisfactory performance, how do we solve the problem without re-training the model on the entire data from scratch? We study the above questions on two datasets with thousands of labelled news items, and our results show that: (1) GNNs can achieve comparable or superior performance without any text information to state-of-the-art methods. (2) GNNs trained on a given dataset may perform poorly on new, unseen data, and direct incremental training cannot solve the problem---this issue has not been addressed in the previous work that applies GNNs for fake news detection. In order to solve the problem, we propose a method that achieves balanced performance on both existing and new datasets, by using techniques from continual learning to train GNNs incrementally." ; skos:prefLabel "GNNCL" . :GNS a skos:Concept ; dcterms:source ; skos:altLabel "Graph Network-based Simulators" ; skos:definition "**Graph Network-Based Simulators** is a type of graph neural network that represents the state of a physical system with particles, expressed as nodes in a graph, and computes dynamics via learned message-passing." ; skos:prefLabel "GNS" . :GPFL a skos:Concept ; dcterms:source ; skos:altLabel "Graph Path Feature Learning" ; skos:definition "**Graph Path Feature Learning** is a probabilistic rule learner optimized to mine instantiated first-order logic rules from knowledge graphs. Instantiated rules contain constants extracted from KGs. Compared to abstract rules that contain no constants, instantiated rules are capable of explaining and expressing concepts in more detail. GPFL utilizes a novel two-stage rule generation mechanism that first generalizes extracted paths into templates that are acyclic abstract rules until a certain degree of template saturation is achieved, then specializes the generated templates into instantiated rules." ; skos:prefLabel "GPFL" . :GPS a skos:Concept ; dcterms:source ; skos:altLabel "Greedy Policy Search" ; skos:definition "**Greedy Policy Search** (GPS) is a simple algorithm that learns a policy for test-time data augmentation based on the predictive performance on a validation set. GPS starts with an empty policy and builds it in an iterative fashion. Each step selects a sub-policy that provides the largest improvement in calibrated log-likelihood of ensemble predictions and adds it to the current policy." ; skos:prefLabel "GPS" . :GPSA a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Gated Positional Self-Attention" ; skos:definition "**Gated Positional Self-Attention (GPSA)** is a self-attention module for vision transformers, used in the [ConViT](https://paperswithcode.com/method/convit) architecture, that can be initialized as a convolutional layer -- helping a ViT learn inductive biases about locality." ; skos:prefLabel "GPSA" . :GPT a skos:Concept ; dcterms:source ; skos:definition """**GPT** is a [Transformer](https://paperswithcode.com/method/transformer)-based architecture and training procedure for natural language processing tasks. Training follows a two-stage procedure. First, a language modeling objective is used on\r the unlabeled data to learn the initial parameters of a neural network model. Subsequently, these parameters are adapted to a target task using the corresponding supervised objective.""" ; skos:prefLabel "GPT" . :GPT-2 a skos:Concept ; dcterms:source ; skos:definition """**GPT-2** is a [Transformer](https://paperswithcode.com/methods/category/transformers) architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous [GPT](https://paperswithcode.com/method/gpt) architecture with some modifications:\r \r - [Layer normalization](https://paperswithcode.com/method/layer-normalization) is moved to the input of each sub-block, similar to a\r pre-activation residual network and an additional layer normalization was added after the final self-attention block. \r \r - A modified initialization which accounts for the accumulation on the residual path with model depth\r is used. Weights of residual layers are scaled at initialization by a factor of $1/\\sqrt{N}$ where $N$ is the number of residual layers. \r \r - The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and\r a larger batch size of 512 is used.""" ; skos:prefLabel "GPT-2" . :GPT-3 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**GPT-3** is an autoregressive [transformer](https://paperswithcode.com/methods/category/transformers) model with 175 billion\r parameters. It uses the same architecture/model as [GPT-2](https://paperswithcode.com/method/gpt-2), including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the [transformer](https://paperswithcode.com/method/transformer), similar to the [Sparse Transformer](https://paperswithcode.com/method/sparse-transformer).""" ; skos:prefLabel "GPT-3" . :GPT-4 a skos:Concept ; dcterms:source ; skos:definition "**GPT-4** is a transformer based model pre-trained to predict the next token in a document." ; skos:prefLabel "GPT-4" . :GPT-Neo a skos:Concept ; skos:definition """An implementation of model & data parallel [GPT3-like](https://paperswithcode.com/method/gpt-3) models using the [mesh-tensorflow](https://github.com/tensorflow/mesh) library.\r \r Source: [EleutherAI/GPT-Neo](https://github.com/EleutherAI/gpt-neo)""" ; skos:prefLabel "GPT-Neo" . :GPT-NeoX a skos:Concept ; dcterms:source ; skos:definition "**GPT-NeoX** is an autoregressive transformer decoder model whose architecture largely follows that of GPT-3, with a few notable deviations. The model has 20 billion parameters with 44 layers, a hidden dimension size of 6144, and 64 heads. The main difference with GPT-3 is the change in tokenizer, the addition of Rotary Positional Embeddings, the parallel computation of attention and feed-forward layers, and a different initialization scheme and hyperparameters." ; skos:prefLabel "GPT-NeoX" . :GPipe a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**GPipe** is a distributed model parallel method for neural networks. With GPipe, each model can be specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is then placed on a separate accelerator. Based on this partitioned setup, batch splitting is applied. A mini-batch of training examples is split into smaller micro-batches, then the execution of each set of micro-batches is pipelined over cells. Synchronous mini-batch gradient descent is applied for training, where gradients are accumulated across all micro-batches in a mini-batch and applied at the end of a mini-batch." ; skos:prefLabel "GPipe" . :GRIN a skos:Concept ; skos:altLabel "Graph Recurrent Imputation Network" ; skos:definition "" ; skos:prefLabel "GRIN" . :GRLIA a skos:Concept ; dcterms:source ; skos:definition "**GRLIA** is an incident aggregation framework for online service systems based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents." ; skos:prefLabel "GRLIA" . :GRU a skos:Concept ; dcterms:source ; skos:altLabel "Gated Recurrent Unit" ; skos:definition """A **Gated Recurrent Unit**, or **GRU**, is a type of recurrent neural network. It is similar to an [LSTM](https://paperswithcode.com/method/lstm), but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier/faster to train than their LSTM counterparts.\r \r Image Source: [here](https://www.google.com/url?sa=i&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3AGated_Recurrent_Unit%2C_type_1.svg&psig=AOvVaw3EmNX8QXC5hvyxeenmJIUn&ust=1590332062671000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCMiev9-eyukCFQAAAAAdAAAAABAR)""" ; skos:prefLabel "GRU" . :GRoIE a skos:Concept ; dcterms:source ; skos:altLabel "Generic RoI Extractor" ; skos:definition """**GroIE** is an RoI extractor which intends to overcome the limitation of existing extractors which select only one (the best) layer from the [FPN](https://paperswithcode.com/method/fpn). The intuition is that all the layers of FPN retain useful\r information. Therefore, the proposed layer introduces non-local building blocks and attention mechanisms to boost the performance.""" ; skos:prefLabel "GRoIE" . :GShard a skos:Concept ; dcterms:source ; skos:definition "**GShard** is a intra-layer parallel distributed method. It consists of set of simple APIs for annotations, and a compiler extension in XLA for automatic parallelization." ; skos:prefLabel "GShard" . :GSoP-Net a skos:Concept ; dcterms:source ; skos:altLabel "Global second-order pooling convolutional networks" ; skos:definition """A Gsop block has a squeeze module and an excitation module, and uses a second-order pooling to model high-order statistics while gathering global information.\r In the squeeze module, a GSoP block firstly reduces the number of channels from $c$ to $c'$ ($c' < c$) using a $1 \\times 1$ convolution, then computes a $c' \\times c'$ covariance matrix for the different channels to obtain their correlation. Next, row-wise normalization is performed on the covariance matrix. Each $(i, j)$ in the normalized covariance matrix explicitly relates channel $i$ to channel $j$. \r \r In the excitation module, a GSoP block performs row-wise convolution to maintain structural information and output a vector. Then a fully-connected layer and a sigmoid function are applied to get a $c$-dimensional attention vector. Finally, it multiplies the input features by the attention vector, as in an SE block. A GSoP block can be formulated as:\r \\begin{align}\r s = F_\\text{gsop}(X, \\theta) & = \\sigma (W \\text{RC}(\\text{Cov}(\\text{Conv}(X))))\r \\end{align}\r \\begin{align}\r Y & = s X\r \\end{align}\r Here, $\\text{Conv}(\\cdot)$ reduces the number of channels,\r $\\text{Cov}(\\cdot)$ computes the covariance matrix and\r $\\text{RC}(\\cdot)$ means row-wise convolution.""" ; skos:prefLabel "GSoP-Net" . :GTS a skos:Concept ; dcterms:source ; skos:altLabel "Goal-Driven Tree-Structured Neural Model" ; skos:definition "" ; skos:prefLabel "GTS" . :GTrXL a skos:Concept ; dcterms:source ; skos:altLabel "Gated Transformer-XL" ; skos:definition """**Gated Transformer-XL**, or **GTrXL**, is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include:\r \r - Placing the [layer normalization](https://paperswithcode.com/method/layer-normalization) on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding.\r - Replacing [residual connections](https://paperswithcode.com/method/residual-connection) with gating layers. The authors' experiments found that [GRUs](https://www.paperswithcode.com/method/gru) were the most effective form of gating.""" ; skos:prefLabel "GTrXL" . :GaAN a skos:Concept ; dcterms:source ; skos:altLabel "Gated Attention Networks" ; skos:definition """Gated Attention Networks (GaAN) is a new architecture for learning on graphs. Unlike the traditional multi-head attention mechanism, which equally consumes all attention heads, GaAN uses a convolutional sub-network to control each attention head’s importance.\r \r Image credit: [GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs](https://paperswithcode.com/paper/gaan-gated-attention-networks-for-learning-on)""" ; skos:prefLabel "GaAN" . :Galactica a skos:Concept ; dcterms:source ; skos:definition """Galactica is a language model which uses a Transformer architecture in a decoder-only setup with the following modifications:\r \r - It uses GeLU activations on all model sizes\r - It uses a 2048 length context window for all model sizes\r - It does not use biases in any of the dense kernels or layer norms\r - It uses learned positional embeddings for the model\r - A vocabulary of 50k tokens was constructed using BPE. The vocabulary was generated from a randomly selected 2% subset of the training data""" ; skos:prefLabel "Galactica" . :GatedConvolution a skos:Concept ; dcterms:source ; skos:definition "A **Gated Convolution** is a type of temporal [convolution](https://paperswithcode.com/method/convolution) with a gating mechanism. Zero-padding is used to ensure that future context can not be seen." ; skos:prefLabel "Gated Convolution" . :GatedConvolutionNetwork a skos:Concept ; dcterms:source ; skos:definition "A **Gated Convolutional Network** is a type of language model that combines convolutional networks with a gating mechanism. Zero padding is used to ensure future context can not be seen. Gated convolutional layers can be stacked on top of other hierarchically. Model predictions are then obtained with an [adaptive softmax](https://paperswithcode.com/method/adaptive-softmax) layer." ; skos:prefLabel "Gated Convolution Network" . :Gather-ExciteNetworks a skos:Concept ; dcterms:source ; skos:definition "GENet combines part gathering and excitation operations. In the first step, it aggregates input features over large neighborhoods and models the relationship between different spatial locations. In the second step, it first generates an attention map of the same size as the input feature map, using interpolation. Then each position in the input feature map is scaled by multiplying by the corresponding element in the attention map." ; skos:prefLabel "Gather-Excite Networks" . :GaussianAffinity a skos:Concept ; rdfs:seeAlso ; skos:definition """**Gaussian Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a Gaussian function:\r \r $$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = e^{\\mathbb{x^{T}\\_{i}}\\mathbb{x\\_{j}}} $$\r \r Here $\\mathbb{x^{T}\\_{i}}\\mathbb{x\\_{j}}$ is dot-product similarity.""" ; skos:prefLabel "Gaussian Affinity" . :GaussianProcess a skos:Concept ; skos:definition """**Gaussian Processes** are non-parametric models for approximating functions. They rely upon a measure of similarity between points (the kernel function) to predict the value for an unseen point from training data. The models are fully probabilistic so uncertainty bounds are baked in with the model.\r \r Image Source: Gaussian Processes for Machine Learning, C. E. Rasmussen & C. K. I. Williams""" ; skos:prefLabel "Gaussian Process" . :GeGLU a skos:Concept ; dcterms:source ; skos:definition """**GeGLU** is an activation function which is a variant of [GLU](https://paperswithcode.com/method/glu). The definition is as follows:\r \r $$ \\text{GeGLU}\\left(x, W, V, b, c\\right) = \\text{GELU}\\left(xW + b\\right) \\otimes \\left(xV + c\\right) $$""" ; skos:prefLabel "GeGLU" . :GeneralizedFocalLoss a skos:Concept ; dcterms:source ; skos:definition "**Generalized Focal Loss (GFL)** is a loss function for object detection that combines Quality [Focal Loss](https://paperswithcode.com/method/focal-loss) and Distribution Focal Loss into a general form." ; skos:prefLabel "Generalized Focal Loss" . :GeneralizedMeanPooling a skos:Concept ; skos:definition """**Generalized Mean Pooling (GeM)** computes the generalized mean of each channel in a tensor. Formally:\r \r $$ \\textbf{e} = \\left[\\left(\\frac{1}{|\\Omega|}\\sum\\_{u\\in{\\Omega}}x^{p}\\_{cu}\\right)^{\\frac{1}{p}}\\right]\\_{c=1,\\cdots,C} $$\r \r where $p > 0$ is a parameter. Setting this exponent as $p > 1$ increases the contrast of the pooled feature map and focuses on the salient features of the image. GeM is a generalization of the [average pooling](https://paperswithcode.com/method/average-pooling) commonly used in classification networks ($p = 1$) and of spatial max-pooling layer ($p = \\infty$).\r \r Source: [MultiGrain](https://paperswithcode.com/method/multigrain)\r \r Image Source: [Eva Mohedano](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.slideshare.net%2Fxavigiro%2Fd1l5-contentbased-image-retrieval-upc-2018-deep-learning-for-computer-vision&psig=AOvVaw2-9Hx23FNGFDe4GHU22Oo5&ust=1591798200590000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCOiP-9P09OkCFQAAAAAdAAAAABAD)""" ; skos:prefLabel "Generalized Mean Pooling" . :GeniePath a skos:Concept ; dcterms:source ; skos:definition """GeniePath is a scalable approach for learning adaptive receptive fields of neural networks defined on permutation invariant graph data. In GeniePath, we propose an adaptive path layer consists of two complementary functions designed for breadth and depth exploration respectively, where the former learns the importance of different sized neighborhoods, while the latter extracts and filters signals aggregated from neighbors of different hops away.\r \r Description and image from: [GeniePath: Graph Neural Networks with Adaptive Receptive Paths](https://arxiv.org/pdf/1802.00910.pdf)""" ; skos:prefLabel "GeniePath" . :GhostBottleneck a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Ghost BottleNeck** is a skip connection block, similar to the basic [residual block](https://paperswithcode.com/method/residual-block) in [ResNet](https://paperswithcode.com/method/resnet) in which several convolutional layers and shortcuts are integrated, but stacks [Ghost Modules](https://paperswithcode.com/method/ghost-module) instead (two stacked Ghost modules). It was proposed as part of the [GhostNet](https://paperswithcode.com/method/ghostnet) CNN architecture.\r \r The first Ghost module acts as an expansion layer increasing the number of channels. The ratio between the number of the output channels and that of the input is referred to as the *expansion ratio*. The second Ghost module reduces the number of channels to match the shortcut path. Then the shortcut is connected between the inputs and the outputs of these two Ghost modules. The [batch normalization](https://paperswithcode.com/method/batch-normalization) (BN) and [ReLU](https://paperswithcode.com/method/relu) nonlinearity are applied after each layer, except that ReLU is not used after the second Ghost module as suggested by [MobileNetV2](https://paperswithcode.com/method/mobilenetv2). The Ghost bottleneck described above is for stride=1. As for the case where stride=2, the shortcut path is implemented by a downsampling layer and a [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution) with stride=2 is inserted between the two Ghost modules. In practice, the primary [convolution](https://paperswithcode.com/method/convolution) in Ghost module here is [pointwise convolution](https://paperswithcode.com/method/pointwise-convolution) for its efficiency.""" ; skos:prefLabel "Ghost Bottleneck" . :GhostModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Ghost Module** is an image block for convolutional neural network that aims to generate more features by using fewer parameters. Specifically, an ordinary convolutional layer in deep neural networks is split into two parts. The first part involves ordinary convolutions but their total number is controlled. Given the intrinsic feature maps from the first part, a series of simple linear operations are applied for generating more feature maps. \r \r Given the widely existing redundancy in intermediate feature maps calculated by mainstream CNNs, ghost modules aim to reduce them. In practice, given the input data $X\\in\\mathbb{R}^{c\\times h\\times w}$, where $c$ is the number of input channels and $h$ and $w$ are the height and width of the input data, respectively, the operation of an arbitrary convolutional layer for producing $n$ feature maps can be formulated as\r \r $$\r Y = X*f+b,\r $$\r \r where $*$ is the [convolution](https://paperswithcode.com/method/convolution) operation, $b$ is the bias term, $Y\\in\\mathbb{R}^{h'\\times w'\\times n}$ is the output feature map with $n$ channels, and $f\\in\\mathbb{R}^{c\\times k\\times k \\times n}$ is the convolution filters in this layer. In addition, $h'$ and $w'$ are the height and width of the output data, and $k\\times k$ is the kernel size of convolution filters $f$, respectively. During this convolution procedure, the required number of FLOPs can be calculated as $n\\cdot h'\\cdot w'\\cdot c\\cdot k\\cdot k$, which is often as large as hundreds of thousands since the number of filters $n$ and the channel number $c$ are generally very large (e.g. 256 or 512).\r \r Here, the number of parameters (in $f$ and $b$) to be optimized is explicitly determined by the dimensions of input and output feature maps. The output feature maps of convolutional layers often contain much redundancy, and some of them could be similar with each other. We point out that it is unnecessary to generate these redundant feature maps one by one with large number of FLOPs and parameters. Suppose that the output feature maps are *ghosts* of a handful of intrinsic feature maps with some cheap transformations. These intrinsic feature maps are often of smaller size and produced by ordinary convolution filters. Specifically, $m$ intrinsic feature maps $Y'\\in\\mathbb{R}^{h'\\times w'\\times m}$ are generated using a primary convolution:\r \r $$\r Y' = X*f',\r $$\r \r where $f'\\in\\mathbb{R}^{c\\times k\\times k \\times m}$ is the utilized filters, $m\\leq n$ and the bias term is omitted for simplicity. The hyper-parameters such as filter size, stride, padding, are the same as those in the ordinary convolution to keep the spatial size (ie $h'$ and $w'$) of the output feature maps consistent. To further obtain the desired $n$ feature maps, we apply a series of cheap linear operations on each intrinsic feature in $Y'$ to generate $s$ ghost features according to the following function:\r \r $$\r y_{ij} = \\Phi_{i,j}(y'_i),\\quad \\forall\\; i = 1,...,m,\\;\\; j = 1,...,s,\r $$\r \r where $y'\\_i$ is the $i$-th intrinsic feature map in $Y'$, $\\Phi\\_{i,j}$ in the above function is the $j$-th (except the last one) linear operation for generating the $j$-th ghost feature map $y_{ij}$, that is to say, $y'\\_i$ can have one or more ghost feature maps $\\{y\\_{ij}\\}\\_{j=1}^{s}$. The last $\\Phi\\_{i,s}$ is the identity mapping for preserving the intrinsic feature maps. we can obtain $n=m\\cdot s$ feature maps $Y=[y\\_{11},y\\_{12},\\cdots,y\\_{ms}]$ as the output data of a Ghost module. Note that the linear operations $\\Phi$ operate on each channel whose computational cost is much less than the ordinary convolution. In practice, there could be several different linear operations in a Ghost module, eg $3\\times 3$ and $5\\times5$ linear kernels, which will be analyzed in the experiment part.""" ; skos:prefLabel "Ghost Module" . :GhostNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **GhostNet** is a type of convolutional neural network that is built using Ghost modules, which aim to generate more features by using fewer parameters (allowing for greater efficiency). \r \r GhostNet mainly consists of a stack of Ghost bottlenecks with the Ghost modules as the building block. The first layer is a standard convolutional layer with 16 filters, then a series of Ghost bottlenecks with gradually increased channels follow. These Ghost bottlenecks are grouped into different stages according to the sizes of their input feature maps. All the Ghost bottlenecks are applied with stride=1 except that the last one in each stage is with stride=2. At last a [global average pooling](https://paperswithcode.com/method/global-average-pooling) and a convolutional layer are utilized to transform the feature maps to a 1280-dimensional feature vector for final classification. The squeeze and excite (SE) module is also applied to the residual layer in some ghost bottlenecks. \r \r In contrast to [MobileNetV3](https://paperswithcode.com/method/mobilenetv3), GhostNet does not use [hard-swish](https://paperswithcode.com/method/hard-swish) nonlinearity function due to its large latency.""" ; skos:prefLabel "GhostNet" . :GloVe a skos:Concept ; dcterms:source ; skos:altLabel "GloVe Embeddings" ; skos:definition """**GloVe Embeddings** are a type of word embedding that encode the co-occurrence probability ratio between two words as vector differences. GloVe uses a weighted least squares objective $J$ that minimizes the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:\r \r $$ J=\\sum\\_{i, j=1}^{V}f\\left(𝑋\\_{i j}\\right)(w^{T}\\_{i}\\tilde{w}_{j} + b\\_{i} + \\tilde{b}\\_{j} - \\log{𝑋}\\_{ij})^{2} $$\r \r where $w\\_{i}$ and $b\\_{i}$ are the word vector and bias respectively of word $i$, $\\tilde{w}_{j}$ and $b\\_{j}$ are the context word vector and bias respectively of word $j$, $X\\_{ij}$ is the number of times word $i$ occurs in the context of word $j$, and $f$ is a weighting function that assigns lower weights to rare and frequent co-occurrences.""" ; skos:prefLabel "GloVe" . :Global-LocalAttention a skos:Concept ; dcterms:source ; skos:definition "**Global-Local Attention** is a type of attention mechanism used in the [ETC](https://paperswithcode.com/method/etc) architecture. ETC receives two separate input sequences: the global input $x^{g} = (x^{g}\\_{1}, \\dots, x^{g}\\_{n\\_{g}})$ and the long input $x^{l} = (x^{l}\\_{1}, \\dots x^{l}\\_{n\\_{l}})$. Typically, the long input contains the input a [standard Transformer](https://paperswithcode.com/method/transformer) would receive, while the global input contains a much smaller number of auxiliary tokens ($n\\_{g} \\ll n\\_{l}$). Attention is then split into four separate pieces: global-to-global (g2g), global-tolong (g2l), long-to-global (l2g), and long-to-long (l2l). Attention in the l2l piece (the most computationally expensive piece) is restricted to a fixed radius $r \\ll n\\_{l}$. To compensate for this limited attention span, the tokens in the global input have unrestricted attention, and thus long input tokens can transfer information to each other through global input tokens. Accordingly, g2g, g2l, and l2g pieces of attention are unrestricted." ; skos:prefLabel "Global-Local Attention" . :GlobalAveragePooling a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Global Average Pooling** is a pooling operation designed to replace fully connected layers in classical CNNs. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the [softmax](https://paperswithcode.com/method/softmax) layer. \r \r One advantage of global [average pooling](https://paperswithcode.com/method/average-pooling) over the fully connected layers is that it is more native to the [convolution](https://paperswithcode.com/method/convolution) structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.""" ; skos:prefLabel "Global Average Pooling" . :GlobalContextBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Global Context Block** is an image model block for global context modeling. The aim is to have both the benefits of the simplified [non-local block](https://paperswithcode.com/method/non-local-block) with effective modeling of long-range dependencies, and the [squeeze-excitation block](https://paperswithcode.com/method/squeeze-and-excitation-block) with lightweight computation. \r \r In the Global Context framework, we have (a) global attention pooling, which adopts a [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) $W_{k}$ and [softmax](https://paperswithcode.com/method/softmax) function to obtain the attention weights, and then performs the attention pooling to obtain the global context features, (b) feature transform via a 1x1 [convolution](https://paperswithcode.com/method/convolution) $W\\_{v}$; (c) feature aggregation, which employs addition to aggregate the global context features to the features of each position. Taken as a whole, the GC block is proposed as a lightweight way to achieve global context modeling.""" ; skos:prefLabel "Global Context Block" . :GlobalConvolutionalNetwork a skos:Concept ; dcterms:source ; skos:definition """A **Global Convolutional Network**, or **GCN**, is a semantic segmentation building block that utilizes a large kernel to help perform classification and localization tasks simultaneously. It can be used in a [FCN](https://paperswithcode.com/method/fcn)-like structure, where the [GCN](https://paperswithcode.com/method/gcn) is used to generate semantic score maps. Instead of directly using larger kernels or global [convolution](https://paperswithcode.com/method/convolution), the GCN module employs a combination of $1 \\times k + k \\times 1$ and $k \\times 1 + 1 \\times k$ convolutions, which enables [dense connections](https://paperswithcode.com/method/dense-connections) within a large\r $k\\times{k}$ region in the feature map""" ; skos:prefLabel "Global Convolutional Network" . :GlobalLocalAttentionModule a skos:Concept ; dcterms:source ; skos:definition """The Global Local Attention Module (GLAM) is an image model block that attends to the feature map's channels and spatial dimensions locally, and also attends to the feature map's channels and spatial dimensions globally. The locally attended feature maps, globally attended feature maps, and the original feature maps are then fused through a weighted sum (with learnable weights) to obtain the final feature map.\r \r Paper:\r \r Song, C. H., Han, H. J., & Avrithis, Y. (2022). All the attention you need: Global-local, spatial-channel attention for image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2754-2763).""" ; skos:prefLabel "Global Local Attention Module" . :GlobalSub-SampledAttention a skos:Concept ; dcterms:source ; skos:definition """**Global Sub-Sampled Attention**, or **GSA**, is a local [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) used in the [Twins-SVT](https://paperswithcode.com/method/twins-svt) architecture. \r \r A single representative is used to summarize the key information for each of $m \\times n$ subwindows and the representative is used to communicate with other sub-windows (serving as the key in self-attention), which can reduce the cost to $\\mathcal{O}(m n H W d)=\\mathcal{O}\\left(\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}\\right)$. This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations, and thus it is termed global sub-sampled attention (GSA). \r \r If we alternatively use the [LSA](https://paperswithcode.com/method/locally-grouped-self-attention) and GSA like [separable convolutions](https://paperswithcode.com/method/depthwise-separable-convolution) (depth-wise + point-wise). The total computation cost is $\\mathcal{O}\\left(\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}+k\\_{1} k\\_{2} H W d\\right) .$ We have:\r \r $$\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}+k_{1} k_{2} H W d \\geq 2 H W d \\sqrt{H W} $$ \r \r The minimum is obtained when $k\\_{1} \\cdot k\\_{2}=\\sqrt{H W}$. Note that $H=W=224$ is popular in classification. Without loss of generality, square sub-windows are used, i.e., $k\\_{1}=k\\_{2}$. Therefore, $k\\_{1}=k\\_{2}=15$ is close to the global minimum for $H=W=224$. However, the network is designed to include several stages with variable resolutions. Stage 1 has feature maps of $56 \\times 56$, the minimum is obtained when $k\\_{1}=k\\_{2}=\\sqrt{56} \\approx 7$. Theoretically, we can calibrate optimal $k\\_{1}$ and $k\\_{2}$ for each of the stages. For simplicity, $k\\_{1}=k\\_{2}=7$ is used everywhere. As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4,2 and 1 are used for the last three stages respectively.""" ; skos:prefLabel "Global Sub-Sampled Attention" . :GlobalandSlidingWindowAttention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Global and Sliding Window Attention** is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original [Transformer](https://paperswithcode.com/method/transformer) formulation has a [self-attention component](https://paperswithcode.com/method/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r \r Since [windowed](https://paperswithcode.com/method/sliding-window-attention) and [dilated](https://paperswithcode.com/method/dilated-sliding-window-attention) attention patterns are not flexible enough to learn task-specific representations, the authors of the [Longformer](https://paperswithcode.com/method/longformer) add “global attention” on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.""" ; skos:prefLabel "Global and Sliding Window Attention" . :Glow-TTS a skos:Concept ; dcterms:source ; skos:definition "**Glow-TTS** is a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech. The model is directly trained to maximize the log-likelihood of speech with the alignment. Enforcing hard monotonic alignments helps enable robust TTS, which generalizes to long utterances, and employing flows enables fast, diverse, and controllable speech synthesis." ; skos:prefLabel "Glow-TTS" . :Go-Explore a skos:Concept ; dcterms:source ; skos:definition """**Go-Explore** is a family of algorithms aiming to tackle two challenges with effective exploration in reinforcement learning: algorithms forgetting how to reach previously visited states ("detachment") and from failing to first return to a state before exploring from it ("derailment").\r \r To avoid detachment, Go-Explore builds an archive of the different states it has visited in the environment, thus ensuring that states cannot be forgotten. Starting with an archive beginning with the initial state, the archive is built iteratively. In Go-Explore we:\r \r (a) Probabilistically select a state from the archive, preferring states associated with promising cells. \r \r (b) Return to the selected state, such as by restoring simulator state or by running a goal-conditioned policy. \r \r (c) Explore from that state by taking random actions or sampling from a trained policy. \r \r (d) Map every state encountered during returning and exploring to a low-dimensional cell representation. \r \r (e) Add states that map to new cells to the archive and update other archive entries.""" ; skos:prefLabel "Go-Explore" . :GoodFeatureMatching a skos:Concept ; dcterms:source ; skos:definition "**Good Feature Matching** is an active map-to-frame feature matching method. Feature matching effort is tied to submatrix selection, which has combinatorial time complexity and requires choosing a scoring metric. Via simulation, the Max-logDet matrix revealing metric is shown to perform best." ; skos:prefLabel "Good Feature Matching" . :GoogLeNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**GoogLeNet** is a type of convolutional neural network based on the [Inception](https://paperswithcode.com/method/inception-module) architecture. It utilises Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. An Inception network stacks these modules on top of each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid." ; skos:prefLabel "GoogLeNet" . :GraRep a skos:Concept ; dcterms:source ; skos:altLabel "Graph Representation with Global structure" ; skos:definition "" ; skos:prefLabel "GraRep" . :Grab a skos:Concept ; dcterms:source ; skos:definition "**Grab** is a sensor processing system for cashier-free shopping. Grab needs to accurately identify and track customers, and associate each shopper with items he or she retrieves from shelves. To do this, it uses a keypoint-based pose tracker as a building block for identification and tracking, develops robust feature-based face trackers, and algorithms for associating and tracking arm movements. It also uses a probabilistic framework to fuse readings from camera, weight and RFID sensors in order to accurately assess which shopper picks up which item." ; skos:prefLabel "Grab" . :GradDrop a skos:Concept ; dcterms:source ; skos:altLabel "Gradient Sign Dropout" ; skos:definition """**GradDrop**, or **Gradient Sign Dropout**, is a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. It is applied as a layer in any standard network forward pass, usually on the final layer before the prediction head to save on compute overhead and maximize benefits during backpropagation. Below, we develop the GradDrop formalism. Throughout, o denotes elementwise multiplication after any necessary tiling operations (if any) are completed.\r To implement GradDrop, we first define the Gradient Positive Sign Purity, $\\mathcal{P}$, as\r \r $$\r \\mathcal{P}=\\frac{1}{2}\\left(1+\\frac{\\sum\\_{i} \\nabla L_\\{i}}{\\sum\\_{i}\\left|\\nabla L\\_{i}\\right|}\\right)\r $$\r \r $\\mathcal{P}$ is bounded by $[0,1] .$ For multiple gradient values $\\nabla\\_{a} L\\_{i}$ at some scalar $a$, we see that $\\mathcal{P}=0$ if $\\nabla_{a} L\\_{i}<0 $ $\\forall i$, while $\\mathcal{P}=1$ if $\\nabla\\_{a} L\\_{i}>0$ $\\forall i $. Thus, $\\mathcal{P}$ is a measure of how many positive gradients are present at any given value. We then form a mask for each gradient $\\mathcal{M}\\_{i}$ as follows:\r \r $$\r \\mathcal{M}\\_{i}=\\mathcal{I}[f(\\mathcal{P})>U] \\circ \\mathcal{I}\\left[\\nabla L\\_{i}>0\\right]+\\mathcal{I}[f(\\mathcal{P}) ; rdfs:seeAlso ; skos:altLabel "GBST" ; skos:definition """**GBST**, or **Gradient-based Subword Tokenization Module**, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. \r \r GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.""" ; skos:prefLabel "Gradient-Based Subword Tokenization" . :GradientCheckpointing a skos:Concept ; dcterms:source ; skos:definition "**Gradient Checkpointing** is a method used for reducing the memory footprint when training deep neural networks, at the cost of having a small increase in computation time." ; skos:prefLabel "Gradient Checkpointing" . :GradientClipping a skos:Concept ; skos:definition """One difficulty that arises with optimization of deep neural networks is that large parameter gradients can lead an [SGD](https://paperswithcode.com/method/sgd) optimizer to update the parameters strongly into a region where the loss function is much greater, effectively undoing much of the work that was needed to get to the current solution.\r \r **Gradient Clipping** clips the size of the gradients to ensure optimization performs more reasonably near sharp areas of the loss surface. It can be performed in a number of ways. One option is to simply clip the parameter gradient element-wise before a parameter update. Another option is to clip the norm ||$\\textbf{g}$|| of the gradient $\\textbf{g}$ before a parameter update:\r \r $$\\text{ if } ||\\textbf{g}|| > v \\text{ then } \\textbf{g} \\leftarrow \\frac{\\textbf{g}{v}}{||\\textbf{g}||}$$\r \r where $v$ is a norm threshold.\r \r Source: Deep Learning, Goodfellow et al\r \r Image Source: [Pascanu et al](https://arxiv.org/pdf/1211.5063.pdf)""" ; skos:prefLabel "Gradient Clipping" . :GradientDICE a skos:Concept ; dcterms:source ; skos:definition "**GradientDICE** is a density ratio learning method for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. It optimizes a different objective from [GenDICE](https://arxiv.org/abs/2002.09072) by using the Perron-Frobenius theorem and eliminating GenDICE’s use of divergence, such that nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation." ; skos:prefLabel "GradientDICE" . :GradientNormalization a skos:Concept ; dcterms:source ; skos:definition "**Gradient Normalization** is a normalization method for [Generative Adversarial Networks](https://paperswithcode.com/methods/category/generative-adversarial-networks) to tackle the training instability of generative adversarial networks caused by the sharp gradient space. Unlike existing work such as [gradient penalty](https://paperswithcode.com/method/wgan-gp-loss) and [spectral normalization](https://paperswithcode.com/method/spectral-normalization), the proposed GN only imposes a hard 1-Lipschitz constraint on the discriminator function, which increases the capacity of the network." ; skos:prefLabel "Gradient Normalization" . :GradientSparsification a skos:Concept ; dcterms:source ; skos:definition "**Gradient Sparsification** is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient." ; skos:prefLabel "Gradient Sparsification" . :GradualSelf-Training a skos:Concept ; dcterms:source ; skos:definition """Gradual self-training is a method for semi-supervised domain adaptation. The goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. \r \r This comes up for example in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces, where machine learning systems must adapt to data distributions that evolve over time.\r \r The gradual self-training algorithm begins with a classifier $w_0$ trained on labeled examples from the source domain (Figure a). For each successive domain $P_t$, the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in the Figure, is that after a single gradual shift, most examples are pseudolabeled correctly so self-training learns a good classifier on the shifted data, but the shift from the source to the target can be too large for self-training to correct.""" ; skos:prefLabel "Gradual Self-Training" . a skos:Concept ; dcterms:source ; skos:altLabel "Grammatical evolution and Q-learning" ; skos:definition """This method works as a two-levels optimization algorithm.\r The outmost layer uses Grammatical evolution to evolve a grammar to build the agent.\r Then, [Q-learning](https://paperswithcode.com/method/q-learning) is used the fitness evaluation phase to allow the agent to learn to perform online learning.""" ; skos:prefLabel "Grammatical evolution + Q-learning" . :Graph2Tree a skos:Concept ; dcterms:source ; skos:altLabel "Graph-to-Tree MWP Solver" ; skos:definition "" ; skos:prefLabel "Graph2Tree" . :GraphContrastiveCoding a skos:Concept ; dcterms:source ; skos:definition "**Graph Contrastive Coding** is a self-supervised graph neural network pre-training framework to capture the universal network topological properties across multiple networks. GCC's pre-training task is designed as subgraph instance discrimination in and across networks and leverages contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations." ; skos:prefLabel "Graph Contrastive Coding" . :GraphESN a skos:Concept ; skos:altLabel "Graph Echo State Network" ; skos:definition """**Graph Echo State Network** (**GraphESN**) model is a generalization of the Echo State Network (ESN) approach to graph domains. GraphESNs allow for an efficient approach to Recursive Neural Networks (RecNNs) modeling extended to deal with cyclic/acyclic, directed/undirected, labeled graphs. The recurrent reservoir of the network computes a fixed contractive encoding function over graphs and is left untrained after initialization, while a feed-forward readout implements an adaptive linear output function. Contractivity of the state transition function implies a Markovian characterization of state dynamics and stability of the state computation in presence of cycles. Due to the use of fixed (untrained) encoding, the model represents both an extremely efficient version and a baseline for the performance of recursive models with trained connections.\r \r Description from: [Graph Echo State Networks](https://ieeexplore.ieee.org/document/5596796)""" ; skos:prefLabel "GraphESN" . :GraphSAGE a skos:Concept ; dcterms:source ; skos:definition """GraphSAGE is a general inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data.\r \r Image from: [Inductive Representation Learning on Large Graphs](https://arxiv.org/pdf/1706.02216v4.pdf)""" ; skos:prefLabel "GraphSAGE" . :GraphSAINT a skos:Concept ; dcterms:source ; skos:altLabel "Graph sampling based inductive learning method" ; skos:definition "Scalable method to train large scale GNN models via sampling small subgraphs." ; skos:prefLabel "GraphSAINT" . :GraphSelf-Attention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Graph Self-Attention (GSA)** is a self-attention module used in the [BP-Transformer](https://paperswithcode.com/method/bp-transformer) architecture, and is based on the [graph attentional layer](https://paperswithcode.com/method/graph-attentional-layer).\r \r For a given node $u$, we update its representation according to its neighbour nodes, formulated as $\\mathbf{h}\\_{u} \\leftarrow \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$.\r \r Let $\\mathbf{A}\\left(u\\right)$ denote the set of the neighbour nodes of $u$ in $\\mathcal{G}$, $\\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$ is detailed as follows:\r \r $$ \\mathbf{A}^{u} = \\text{concat}\\left(\\{\\mathbf{h}\\_{v} | v \\in \\mathcal{A}\\left(u\\right)\\}\\right) $$\r \r $$ \\mathbf{Q}^{u}\\_{i} = \\mathbf{H}\\_{k}\\mathbf{W}^{Q}\\_{i},\\mathbf{K}\\_{i}^{u} = \\mathbf{A}^{u}\\mathbf{W}^{K}\\_{i},\\mathbf{V}^{u}\\_{i} = \\mathbf{A}^{u}\\mathbf{W}\\_{i}^{V} $$\r \r $$ \\text{head}^{u}\\_{i} = \\text{softmax}\\left(\\frac{\\mathbf{Q}^{u}\\_{i}\\mathbf{K}\\_{i}^{uT}}{\\sqrt{d}}\\right)\\mathbf{V}\\_{i}^{u} $$\r \r $$ \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right) = \\left[\\text{head}^{u}\\_{1}, \\dots, \\text{head}^{u}\\_{h}\\right]\\mathbf{W}^{O}$$\r \r where d is the dimension of h, and $\\mathbf{W}^{Q}\\_{i}$, $\\mathbf{W}^{K}\\_{i}$ and $\\mathbf{W}^{V}\\_{i}$ are trainable parameters of the $i$-th attention head.""" ; skos:prefLabel "Graph Self-Attention" . :GraphTransformer a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """This is **Graph Transformer** method, proposed as a generalization of [Transformer](https://paperswithcode.com/method/transformer) Neural Network architectures, for arbitrary graphs.\r \r Compared to the original Transformer, the highlights of the presented architecture are:\r \r - The attention mechanism is a function of neighborhood connectivity for each node in the graph. \r - The position encoding is represented by Laplacian eigenvectors, which naturally generalize the sinusoidal positional encodings often used in NLP. \r - The [layer normalization](https://paperswithcode.com/method/layer-normalization) is replaced by a [batch normalization](https://paperswithcode.com/method/batch-normalization) layer. \r - The architecture is extended to have edge representation, which can be critical to tasks with rich information on the edges, or pairwise interactions (such as bond types in molecules, or relationship type in KGs. etc).""" ; skos:prefLabel "Graph Transformer" . :Gravity a skos:Concept ; dcterms:source ; skos:definition "Gravity is a kinematic approach to optimization based on gradients." ; skos:prefLabel "Gravity" . :GreedyNAS a skos:Concept ; dcterms:source ; skos:definition """**GreedyNAS** is a one-shot [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. Previous methods held the assumption that a supernet should give a reasonable ranking over all paths. They thus treat all paths equally, and spare much effort to train paths. However, it is harsh for a single supernet to evaluate accurately on such a huge-scale search space (eg, $7^{21}$). GreedyNAS eases the burden of supernet by encouraging focus more on evaluation of potentially-good candidates, which are identified using a surrogate portion of validation data. \r \r Concretely, during training, GreedyNAS utilizes a multi-path sampling strategy with rejection, and greedily filters the weak paths. The training efficiency is thus boosted since the training space has been greedily shrunk from all paths to those potentially-good ones. An exploration and exploitation policy is adopted by introducing an empirical candidate path pool.""" ; skos:prefLabel "GreedyNAS" . :GreedyNAS-A a skos:Concept ; dcterms:source ; skos:definition "**GreedyNAS-A** is a convolutional neural network discovered using the [GreedyNAS](https://paperswithcode.com/method/greedynas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)) and squeeze-and-excitation blocks." ; skos:prefLabel "GreedyNAS-A" . :GreedyNAS-B a skos:Concept ; dcterms:source ; skos:definition "**GreedyNAS-B** is a convolutional neural network discovered using the [GreedyNAS](https://paperswithcode.com/method/greedynas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)) and squeeze-and-excitation blocks." ; skos:prefLabel "GreedyNAS-B" . :GreedyNAS-C a skos:Concept ; dcterms:source ; skos:definition "**GreedyNAS-C** is a convolutional neural network discovered using the [GreedyNAS](https://paperswithcode.com/method/greedynas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)) and squeeze-and-excitation blocks." ; skos:prefLabel "GreedyNAS-C" . :GridMask a skos:Concept ; dcterms:source ; skos:definition """**GridMask** is a data augmentation method that randomly removes some pixels of an input image. Unlike other methods, the region that the algorithm removes is neither a continuous region nor random pixels in dropout. Instead, the algorithm removes a region with disconnected pixel sets, as shown in the Figure.\r \r We express the setting as\r \r $$\r \\tilde{\\mathbf{x}}=\\mathbf{x} \\times M\r $$\r \r where $\\mathbf{x} \\in R^{H \\times W \\times C}$ represents the input image, $M \\in$ $\\{0,1\\}^{H \\times W}$ is the binary mask that stores pixels to be removed, and $\\tilde{\\mathbf{x}} \\in R^{H \\times W \\times C}$ is the result produced by the algorithm. For the binary mask $M$, if $M_{i, j}=1$ we keep pixel $(i, j)$ in the input image; otherwise we remove it. GridMask is applied after the image normalization operation.\r \r The shape of $M$ looks like a grid, as shown in the Figure . Four numbers $\\left(r, d, \\delta_{x}, \\delta_{y}\\right)$ are used to represent a unique $M$. Every mask is formed by tiling the units. $r$ is the ratio of the shorter gray edge in a unit. $d$ is the length of one unit. $\\delta\\_{x}$ and $\\delta\\_{y}$ are the distances between the first intact unit and boundary of the image.""" ; skos:prefLabel "GridMask" . :GridR-CNN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Grid R-CNN** is an object detection framework, where the traditional regression\r formulation is replaced by a grid point guided localization mechanism.\r \r Grid R-CNN divides the object bounding box region into grids and employs a fully convolutional network ([FCN](https://paperswithcode.com/method/fcn)) to predict the locations of grid points. Owing to the position sensitive property of fully convolutional architecture, Grid R-CNN maintains the explicit spatial information and grid points locations can be obtained in pixel level. When a certain number of grid points at specified location are known, the corresponding bounding box is definitely determined. Guided by the grid points, Grid R-CNN can determine more accurate object bounding box than regression method which lacks the guidance of explicit spatial information.""" ; skos:prefLabel "Grid R-CNN" . :GridSensitive a skos:Concept ; dcterms:source ; skos:definition """**Grid Sensitive** is a trick for object detection introduced by [YOLOv4](https://paperswithcode.com/method/yolov4). When we decode the coordinate of the bounding box center $x$ and $y$, in original [YOLOv3](https://paperswithcode.com/method/yolov3), we can get them by\r \r $$\r \\begin{aligned}\r &x=s \\cdot\\left(g\\_{x}+\\sigma\\left(p\\_{x}\\right)\\right) \\\\\r &y=s \\cdot\\left(g\\_{y}+\\sigma\\left(p\\_{y}\\right)\\right)\r \\end{aligned}\r $$\r \r where $\\sigma$ is the sigmoid function, $g\\_{x}$ and $g\\_{y}$ are integers and $s$ is a scale factor. Obviously, $x$ and $y$ cannot be exactly equal to $s \\cdot g\\_{x}$ or $s \\cdot\\left(g\\_{x}+1\\right)$. This makes it difficult to predict the centres of bounding boxes that just located on the grid boundary. We can address this problem, by changing the equation to\r \r $$\r \\begin{aligned}\r &x=s \\cdot\\left(g\\_{x}+\\alpha \\cdot \\sigma\\left(p\\_{x}\\right)-(\\alpha-1) / 2\\right) \\\\\r &y=s \\cdot\\left(g\\_{y}+\\alpha \\cdot \\sigma\\left(p\\_{y}\\right)-(\\alpha-1) / 2\\right)\r \\end{aligned}\r $$\r \r This makes it easier for the model to predict bounding box center exactly located on the grid boundary. The FLOPs added by Grid Sensitive are really small, and can be totally ignored.""" ; skos:prefLabel "Grid Sensitive" . :Griffin-LimAlgorithm a skos:Concept ; skos:definition """The **Griffin-Lim Algorithm (GLA)** is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained. GLA is based only on the consistency and does not take any prior knowledge about the target signal into account. \r \r This algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given amplitude $\\mathbf{A}$, by the following alternative projection procedure:\r \r $$ \\mathbf{X}^{[m+1]} = P\\_{\\mathcal{C}}\\left(P\\_{\\mathcal{A}}\\left(\\mathbf{X}^{[m]}\\right)\\right) $$\r \r where $\\mathbf{X}$ is a complex-valued spectrogram updated through the iteration, $P\\_{\\mathcal{S}}$ is the metric projection onto a set $\\mathcal{S}$, and $m$ is the iteration index. Here, $\\mathcal{C}$ is the set of consistent spectrograms, and $\\mathcal{A}$ is the set of spectrograms whose amplitude is the same as the given one. The metric projections onto these sets $\\mathcal{C}$ and $\\mathcal{A}$ are given by:\r \r $$ P\\_{\\mathcal{C}}(\\mathbf{X}) = \\mathcal{GG}^{†}\\mathbf{X} $$\r $$ P\\_{\\mathcal{A}}(\\mathbf{X}) = \\mathbf{A} \\odot \\mathbf{X} \\oslash |\\mathbf{X}| $$\r \r \r where $\\mathcal{G}$ represents STFT, $\\mathcal{G}^{†}$ is the pseudo inverse of STFT (iSTFT), $\\odot$ and $\\oslash$ are element-wise multiplication and division, respectively, and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem:\r \r $$ \\min\\_{\\mathbf{X}} || \\mathbf{X} - P\\_{\\mathcal{C}}\\left(\\mathbf{X}\\right) ||^{2}\\_{\\text{Fro}} \\text{ s.t. } \\mathbf{X} \\in \\mathcal{A} $$\r \r where $ || · ||\\_{\\text{Fro}}$ is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on amplitude which must be equal to the given one. Although GLA has been widely utilized because of its simplicity, GLA often involves many iterations until it converges to a certain spectrogram and results in low reconstruction quality. This is because the cost function only requires the consistency, and the characteristics of the target signal are not taken into account.""" ; skos:prefLabel "Griffin-Lim Algorithm" . :GroupDNet a skos:Concept ; dcterms:source ; skos:altLabel "Group Decreasing Network" ; skos:definition """**Group Decreasing Network**, or **GroupDNet**, is a type of convolutional neural network for multi-modal image synthesis. GroupDNet contains one encoder and one decoder. Inspired by the idea of [VAE](https://paperswithcode.com/method/vae) and SPADE, the encoder $E$ produces a\r latent code $Z$ that is supposed to follow a Gaussian distribution $\\mathcal{N}(0,1)$ during training. While testing, the encoder $E$ is discarded. A randomly sampled code from the Gaussian distribution substitutes for $Z$. To fulfill this, the re-parameterization trick is used to enable a differentiable loss function during training. Specifically, the encoder predicts a mean vector and a variance vector through two fully connected layers to represent the encoded distribution. The gap between the encoded distribution and Gaussian distribution can be minimized by imposing a KL-divergence loss.""" ; skos:prefLabel "GroupDNet" . :GroupNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Group Normalization** is a normalization layer that divides channels into groups and normalizes the features within each group. GN does not exploit the batch dimension, and its computation is independent of batch sizes. In the case where the group size is 1, it is equivalent to [Instance Normalization](https://paperswithcode.com/method/instance-normalization).\r \r As motivation for the method, many classical features like SIFT and HOG had *group-wise* features and involved *group-wise normalization*. For example, a HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram.\r \r Formally, Group Normalization is defined as:\r \r $$ \\mu\\_{i} = \\frac{1}{m}\\sum\\_{k\\in\\mathcal{S}\\_{i}}x\\_{k} $$\r \r $$ \\sigma^{2}\\_{i} = \\frac{1}{m}\\sum\\_{k\\in\\mathcal{S}\\_{i}}\\left(x\\_{k}-\\mu\\_{i}\\right)^{2} $$\r \r $$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{i}}{\\sqrt{\\sigma^{2}\\_{i}+\\epsilon}} $$\r \r Here $x$ is the feature computed by a layer, and $i$ is an index. Formally, a Group Norm layer computes $\\mu$ and $\\sigma$ in a set $\\mathcal{S}\\_{i}$ defined as: $\\mathcal{S}\\_{i} = ${$k \\mid k\\_{N} = i\\_{N} ,\\lfloor\\frac{k\\_{C}}{C/G}\\rfloor = \\lfloor\\frac{I\\_{C}}{C/G}\\rfloor $}.\r \r Here $G$ is the number of groups, which is a pre-defined hyper-parameter ($G = 32$ by default). $C/G$ is the number of channels per group. $\\lfloor$ is the floor operation, and the final term means that the indexes $i$ and $k$ are in the same group of channels, assuming each group of channels are stored in a sequential order along the $C$ axis.""" ; skos:prefLabel "Group Normalization" . :Grouped-queryattention a skos:Concept ; dcterms:source ; skos:definition "**Grouped-query attention** an interpolation of multi-query and multi-head attention that achieves quality close to multi-head at comparable speed to multi-query attention." ; skos:prefLabel "Grouped-query attention" . :GroupedConvolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Grouped Convolution** uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in [AlexNet](https://paperswithcode.com/method/alexnet) was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as [ResNeXt](https://paperswithcode.com/method/resnext), it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, *cardinality* (the size of set of transformations), we can increase accuracy by increasing it." ; skos:prefLabel "Grouped Convolution" . :GroupwisePointConvolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Groupwise Point Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) where we apply a [point convolution](https://paperswithcode.com/method/pointwise-convolution) groupwise (using different set of convolution filter groups).\r \r Image Credit: [Chi-Feng Wang](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)""" ; skos:prefLabel "Groupwise Point Convolution" . :GrowNet a skos:Concept ; dcterms:source ; skos:definition "**GrowNet** is a novel approach to combine the power of gradient boosting to incrementally build complex deep neural networks out of shallow components. It introduces a versatile framework that can readily be adapted for a diverse range of machine learning tasks in a wide variety of domains." ; skos:prefLabel "GrowNet" . :GuidedAnchoring a skos:Concept ; dcterms:source ; skos:definition "**Guided Anchoring** is an anchoring scheme for object detection which leverages semantic features to guide the anchoring. The method is motivated by the observation that objects are not distributed evenly over the image. The scale of an object is also closely related to the imagery content, its location and geometry of the scene. Following this intuition, the method generates sparse anchors in two steps: first identifying sub-regions that may contain objects and then determining the shapes at different locations." ; skos:prefLabel "Guided Anchoring" . :GumbelActivation a skos:Concept ; dcterms:source ; skos:altLabel "Gumbel Cross Entropy" ; skos:definition """Gumbel activation function, is defined using the cumulative Gumbel distribution and it can be used to perform Gumbel regression. Gumbel activation is an alternative activation function to the sigmoid or softmax activation functions and can be used to transform the unormalised output of a model to probability. Gumbel activation $\\eta_{Gumbel}$ is defined as follows:\r \r $\\eta_{Gumbel}(q_i) = exp(-exp(-q_i))$\r \r It can be combined with Cross Entropy loss function to solve long-tailed classification problems. Gumbel Cross Entropy (GCE) is defined as follows:\r \r $GCE(\\eta_{Gumbel}(q_i),y_i) = -y_i \\log(\\eta_{Gumbel}(q_i))+ (1-y_i) \\log(1-\\eta_{Gumbel}(q_i))$""" ; skos:prefLabel "Gumbel Activation" . :GumbelSoftmax a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Gumbel-Softmax** is a continuous distribution that has the property that it can be smoothly annealed into a categorical distribution, and whose parameter gradients can be easily computed via the reparameterization trick." ; skos:prefLabel "Gumbel Softmax" . :H-BEMD a skos:Concept ; dcterms:source ; skos:altLabel "Hue — Bi-Dimensional Empirical Mode Decomposition" ; skos:definition "" ; skos:prefLabel "H-BEMD" . :H3DNet a skos:Concept ; rdfs:seeAlso ; skos:definition "Code for paper: H3DNet: 3D Object Detection Using Hybrid Geometric Primitives (ECCV 2020)" ; skos:prefLabel "H3DNet" . :HANet a skos:Concept ; dcterms:source ; skos:altLabel "Height-driven Attention Network" ; skos:definition "**Height-driven Attention Network**, or **HANet**, is a general add-on module for improving semantic segmentation for urban-scene images. It emphasizes informative features or classes selectively according to the vertical position of a pixel. The pixel-wise class distributions are significantly different from each other among horizontally segmented sections in the urban-scene images. Likewise, urban-scene images have their own distinct characteristics, but most semantic segmentation networks do not reflect such unique attributes in the architecture. The proposed network architecture incorporates the capability exploiting the attributes to handle the urban scene dataset effectively." ; skos:prefLabel "HANet" . :HAPPIER a skos:Concept ; dcterms:source ; skos:altLabel "Hierarchical Average Precision training for Pertinent ImagE Retrieval" ; skos:definition "" ; skos:prefLabel "HAPPIER" . :HBMP a skos:Concept ; dcterms:source ; skos:altLabel "Hierarchical BiLSTM Max Pooling" ; skos:definition "HBMP is a hierarchy-like structure of [BiLSTM](https://paperswithcode.com/method/bilstm) layers with [max pooling](https://paperswithcode.com/method/max-pooling). All in all, this model improves the previous state of the art for SciTail and achieves strong results for the SNLI and MultiNLI." ; skos:prefLabel "HBMP" . :HDCGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "High-resolution Deep Convolutional Generative Adversarial Networks" ; skos:definition """**HDCGAN**, or **High-resolution Deep Convolutional Generative Adversarial Networks**, is a [DCGAN](https://paperswithcode.com/method/dcgan) based architecture that achieves high-resolution image generation through the proper use of [SELU](https://paperswithcode.com/method/selu) activations. Glasses, a mechanism to arbitrarily improve the final [GAN](https://paperswithcode.com/method/gan) generated results by enlarging the input size by a telescope ζ is also set forth. \r \r A video showing the training procedure on CelebA-hq can be found [here](https://youtu.be/1XZB87W0SaY).""" ; skos:prefLabel "HDCGAN" . :HEGCN a skos:Concept ; dcterms:source ; skos:altLabel "Hierarchical Entity Graph Convolutional Network" ; skos:definition """**HEGCN**, or **Hierarchical Entity Graph Convolutional Network** is a model for multi-hop relation extraction across documents. Documents in a document chain are encoded using a bi-directional long short-term memory ([BiLSTM](https://paperswithcode.com/method/bilstm)) layer. On top of the BiLSTM layer, two graph convolutional networks ([GCN](https://paperswithcode.com/method/gcn)) are used, one after another in a hierarchy. \r \r In the first level of the GCN hierarchy, a separate entity mention graph is constructed on each document of the chain using all the entities mentioned in that document. Each mention of an entity in a document is considered as a separate node in the graph. A graph convolutional network (GCN) is used to represent the entity mention graph of each document to capture the relations among the entity mentions in the document. A unified entity-level graph is then constructed across all the documents in the chain. Each node of this entity-level graph represents a unique entity in the document chain. Each common entity between two documents in the chain is represented by a single node in the graph. A GCN is used to represent this entity-level graph to capture the relations among the entities across the documents. \r \r The representations of the nodes of the subject entity and object entity are concatenated and passed to a feed-forward layer with [softmax](https://paperswithcode.com/method/softmax) for relation classification.""" ; skos:prefLabel "HEGCN" . :HFPSO a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Hybrid Firefly and Particle Swarm Optimization" ; skos:definition """**Hybrid Firefly and Particle Swarm Optimization (HFPSO)** is a metaheuristic optimization algorithm that combines strong points of firefly and particle swarm optimization. HFPSO tries to determine the start of the local search process properly by checking the previous global best fitness values.\r \r [Click Here for the Paper](https://www.sciencedirect.com/science/article/abs/pii/S156849461830084X)\r \r [Codes (MATLAB)](https://www.mathworks.com/matlabcentral/fileexchange/67768-a-hybrid-firefly-and-particle-swarm-optimization-hfpso)""" ; skos:prefLabel "HFPSO" . :HGS a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Hunger Games Search" ; skos:definition """**Hunger Games Search (**HGS**)** is a general-purpose population-based optimization technique with a simple structure, special stability features and very competitive performance to realize the solutions of both constrained and unconstrained problems more effectively. HGS is designed according to the hunger-driven activities and behavioural choice of animals. This dynamic, fitness-wise search method follows a simple concept of “Hunger” as the most crucial homeostatic motivation and reason for behaviours, decisions, and actions in the life of all animals to make the process of optimization more understandable and consistent for new users and decision-makers. The Hunger Games Search incorporates the concept of hunger into the feature process; in other words, an adaptive weight based on the concept of hunger is designed and employed to simulate the effect of hunger on each search step. It follows the computationally logical rules (games) utilized by almost all animals and these rival activities and games are often adaptive evolutionary by securing higher chances of survival and food acquisition. This method's main feature is its dynamic nature, simple structure, and high performance in terms of convergence and acceptable quality of solutions, proving to be more efficient than the current optimization methods. \r \r Implementation of the HGS algorithm is available at [https://aliasgharheidari.com/HGS.html](https://aliasgharheidari.com/HGS.html).""" ; skos:prefLabel "HGS" . :HINT a skos:Concept ; dcterms:source ; skos:altLabel "Hierarchical Information Threading" ; skos:definition """An unsupervised approach for identifying Hierarchical Information Threads by analysing the network of related articles in a collection. In particular, HINT leverages article timestamps and the 5W1H questions to identify related articles about an event or discussion. HINT then constructs a network representation of the articles,\r and identify threads as strongly connected hierarchical network communities.""" ; skos:prefLabel "HINT" . :HITNet a skos:Concept ; dcterms:source ; skos:definition """**HITNet** is a framework for neural network based depth estimation which overcomes the computational disadvantages of operating on a 3D volume by integrating image warping, spatial propagation and a fast high resolution initialization step into the network architecture, while keeping the flexibility of a learned representation by allowing features to flow through the network. The main idea of the approach is to represent image tiles as planar patches which have a learned compact feature descriptor attached to them. The basic principle of the approach is to fuse information from the high resolution initialization and the current hypotheses using spatial propagation. The propagation is implemented via a [convolutional neural network](https://paperswithcode.com/methods/category/convolutional-neural-networks) module that updates the estimate of the planar patches and their attached features. \r \r In order for the network to iteratively increase the accuracy of the disparity predictions, the network is provided a local cost volume in a narrow band (±1 disparity) around the planar patch using in-network image warping allowing the network to minimize image dissimilarity. To reconstruct fine details while also capturing large texture-less areas we start at low resolution and hierarchically upsample predictions to higher resolution. A critical feature of the architecture is that at each resolution, matches from the initialization module are provided to facilitate recovery of thin structures that cannot be represented at low resolution.""" ; skos:prefLabel "HITNet" . :HMGNN a skos:Concept ; dcterms:source ; skos:altLabel "Heterogeneous Molecular Graph Neural Network" ; skos:definition "As they carry great potential for modeling complex interactions, graph neural network (GNN)-based methods have been widely used to predict quantum mechanical properties of molecules. Most of the existing methods treat molecules as molecular graphs in which atoms are modeled as nodes. They characterize each atom's chemical environment by modeling its pairwise interactions with other atoms in the molecule. Although these methods achieve a great success, limited amount of works explicitly take many-body interactions, i.e., interactions between three and more atoms, into consideration. In this paper, we introduce a novel graph representation of molecules, heterogeneous molecular graph (HMG) in which nodes and edges are of various types, to model many-body interactions. HMGs have the potential to carry complex geometric information. To leverage the rich information stored in HMGs for chemical prediction problems, we build heterogeneous molecular graph neural networks (HMGNN) on the basis of a neural message passing scheme. HMGNN incorporates global molecule representations and an attention mechanism into the prediction process. The predictions of HMGNN are invariant to translation and rotation of atom coordinates, and permutation of atom indices. Our model achieves state-of-the-art performance in 9 out of 12 tasks on the QM9 dataset." ; skos:prefLabel "HMGNN" . :HOC a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "High-Order Consensuses" ; skos:definition "" ; skos:prefLabel "HOC" . :HOPE a skos:Concept ; dcterms:source ; skos:altLabel "High-Order Proximity preserved Embedding" ; skos:definition "" ; skos:prefLabel "HOPE" . :HPO a skos:Concept ; dcterms:source ; skos:altLabel "Hyper-parameter optimization" ; skos:definition "In machine learning, a hyperparameter is a parameter whose value is used to control learning process, and HPO is the problem of choosing a set of optimal hyperparameters for a learning algorithm." ; skos:prefLabel "HPO" . :HRIpipeline a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Human Robot Interaction Pipeline" ; skos:definition "The pipeline we propose consists of three parts: 1) recognizing the interaction type; 2) detecting the object that the interaction is targeting; and 3) learning incrementally the models from data recorded by the robot sensors. Our main contributions lie in the target object detection, guided by the recognized interaction, and in the incremental object learning. The novelty of our approach is the focus on natural, heterogeneous, and multimodal HRIs to incrementally learn new object models." ; skos:prefLabel "HRI pipeline" . :HRNet a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**HRNet**, or **High-Resolution Net**, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution [convolution](https://paperswithcode.com/method/convolution) stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several ($4$ in the paper) stages and\r the $n$th stage contains $n$ streams corresponding to $n$ resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.""" ; skos:prefLabel "HRNet" . :HRank a skos:Concept ; dcterms:source ; skos:definition "**HRank** is a filter pruning method that explores the High Rank of the feature map in each layer (HRank). The proposed HRank is inspired by the discovery that the average rank of multiple feature maps generated by a single filter is always the same, regardless of the number of image batches CNNs receive. Based on HRank, the authors develop a method that is mathematically formulated to prune filters with low-rank feature maps." ; skos:prefLabel "HRank" . :HS-ResNet a skos:Concept ; dcterms:source ; skos:definition "**HS-ResNet** is a [convolutional neural network](https://paperswithcode.com/methods/category/convolutional-neural-networks) that employs [Hierarchical-Split Block](https://paperswithcode.com/method/hierarchical-split-block) as its central building block within a [ResNet](https://paperswithcode.com/method/resnet)-like architecture." ; skos:prefLabel "HS-ResNet" . :HTC a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Hybrid Task Cascade" ; skos:definition "**Hybrid Task Cascade**, or **HTC**, is a framework for cascading in instance segmentation. It differs from [Cascade Mask R-CNN](https://paperswithcode.com/method/cascade-mask-r-cnn) in two important aspects: (1) instead of performing cascaded refinement on the two tasks of detection and segmentation separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background." ; skos:prefLabel "HTC" . :HTCN a skos:Concept ; dcterms:source ; skos:altLabel "Hierarchical Transferability Calibration Network" ; skos:definition "**Hierarchical Transferability Calibration Network** (HTCN) is an adaptive object detector that hierarchically (local-region/image/instance) calibrates the transferability of feature representations for harmonizing transferability and discriminability. The proposed model consists of three components: (1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment." ; skos:prefLabel "HTCN" . :HalluciNet a skos:Concept ; dcterms:source ; skos:altLabel "Approximating Spatiotemporal Representations Using a 2DCNN" ; skos:definition "Approximating Spatiotemporal Representations Using a 2DCNN" ; skos:prefLabel "HalluciNet" . :HaloNet a skos:Concept ; dcterms:source ; skos:definition "A **HaloNet** is a self-attention based model for efficient image classification. It relies on a local self-attention architecture that efficiently maps to existing hardware with haloing. The formulation breaks translational equivariance, but the authors observe that it improves throughput and accuracies over the centered local self-attention used in regular self-attention. The approach also utilises a strided self-attentive downsampling operation for multi-scale feature extraction." ; skos:prefLabel "HaloNet" . :Hamburger a skos:Concept ; dcterms:source ; skos:definition "**Hamburger** is a global context module that employs matrix decomposition to factorize the learned representation into sub-matrices so as to recover the clean low-rank signal subspace. The key idea is, if we formulate the inductive bias like the global context into an objective function, the optimization algorithm to minimize the objective function can construct a computational graph, i.e., the architecture we need in the networks." ; skos:prefLabel "Hamburger" . :HardELiSH a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**HardELiSH** is an activation function for neural networks. The HardELiSH is a multiplication of the [HardSigmoid](https://paperswithcode.com/method/hard-sigmoid) and [ELU](https://paperswithcode.com/method/elu) in the negative part and a multiplication of the Linear and the HardSigmoid in the positive\r part:\r \r $$f\\left(x\\right) = x\\max\\left(0, \\min\\left(1, \\left(\\frac{x+1}{2}\\right)\\right) \\right) \\text{ if } x \\geq 1$$\r $$f\\left(x\\right) = \\left(e^{x}-1\\right)\\max\\left(0, \\min\\left(1, \\left(\\frac{x+1}{2}\\right)\\right)\\right) \\text{ if } x < 0 $$\r \r Source: [Activation Functions](https://arxiv.org/pdf/1811.03378.pdf)""" ; skos:prefLabel "HardELiSH" . :HardSigmoid a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """The **Hard Sigmoid** is an activation function used for neural networks of the form:\r \r $$f\\left(x\\right) = \\max\\left(0, \\min\\left(1,\\frac{\\left(x+1\\right)}{2}\\right)\\right)$$\r \r Image Source: [Rinat Maksutov](https://towardsdatascience.com/deep-study-of-a-not-very-deep-neural-network-part-2-activation-functions-fd9bd8d406fc)""" ; skos:prefLabel "Hard Sigmoid" . :HardSwish a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Hard Swish** is a type of activation function based on [Swish](https://paperswithcode.com/method/swish), but replaces the computationally expensive sigmoid with a piecewise linear analogue:\r \r $$\\text{h-swish}\\left(x\\right) = x\\frac{\\text{ReLU6}\\left(x+3\\right)}{6} $$""" ; skos:prefLabel "Hard Swish" . :HardtanhActivation a skos:Concept ; skos:definition """**Hardtanh** is an activation function used for neural networks:\r \r $$ f\\left(x\\right) = -1 \\text{ if } x < - 1 $$\r $$ f\\left(x\\right) = x \\text{ if } -1 \\leq x \\leq 1 $$\r $$ f\\left(x\\right) = 1 \\text{ if } x > 1 $$\r \r It is a cheaper and more computationally efficient version of the [tanh activation](https://paperswithcode.com/method/tanh-activation).\r \r Image Source: [Zhuan Lan](https://zhuanlan.zhihu.com/p/30385380)""" ; skos:prefLabel "Hardtanh Activation" . :Harm-Net a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Harmonic Network**, or **Harm-Net**, is a type of convolutional neural network that replaces convolutional layers with \"harmonic blocks\" that use [Discrete Cosine Transform](https://paperswithcode.com/method/discrete-cosine-transform) (DCT) filters. These blocks can be useful in truncating high-frequency information (possible due to the redundancies in the spectral domain)." ; skos:prefLabel "Harm-Net" . :HarmonicBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Harmonic Block** is an image model component that utilizes [Discrete Cosine Transform](https://paperswithcode.com/method/discrete-cosine-transform) (DCT) filters. Convolutional neural networks (CNNs) learn filters in order to capture local correlation patterns in feature space. In contrast, DCT has preset spectral filters, which can be better for compressing information (due to the presence of redundancy in the spectral domain).\r \r DCT has been successfully used for JPEG encoding to transform image blocks into spectral representations to capture the most information with a small number of coefficients. Harmonic blocks learn how to optimally combine spectral coefficients at every layer to produce a fixed size representation defined as a weighted sum of responses to DCT filters. The use of DCT filters allows to address the task of model compression.""" ; skos:prefLabel "Harmonic Block" . :HarrisHawksoptimization\(HHO\) a skos:Concept ; rdfs:seeAlso ; skos:altLabel "Harris Hawks optimization" ; skos:definition """[HHO](https://aliasgharheidari.com/HHO.html) is a popular swarm-based, gradient-free optimization algorithm with several active and time-varying phases of exploration and exploitation. This algorithm initially published by the prestigious Journal of Future Generation Computer Systems (FGCS) in 2019, and from the first day, it has gained increasing attention among researchers due to its flexible structure, high performance, and high-quality results. The main logic of the HHO method is designed based on the cooperative behaviour and chasing styles of Harris' hawks in nature called "surprise pounce". Currently, there are many suggestions about how to enhance the functionality of HHO, and there are also several enhanced variants of the HHO in the leading Elsevier and IEEE transaction journals.\r \r From the algorithmic behaviour viewpoint, there are several effective features in HHO :\r Escaping energy parameter has a dynamic randomized time-varying nature, which can further improve and harmonize the exploratory and exploitive patterns of HHO. This factor also supports HHO to conduct a smooth transition between exploration and exploitation.\r Different exploration mechanisms with respect to the average location of hawks can increase the exploratory trends of HHO throughout initial iterations.\r Diverse LF-based patterns with short-length jumps enrich the exploitative behaviours of HHO when directing a local search.\r The progressive selection scheme supports search agents to progressively advance their position and only select a better position, which can improve the superiority of solutions and intensification powers of HHO throughout the optimization procedure.\r HHO shows a series of searching strategies and then, it selects the best movement step. This feature has also a constructive influence on the exploitation inclinations of HHO.\r The randomized jump strength can assist candidate solutions in harmonising the exploration and exploitation leanings.\r The application of adaptive and time-varying components allows HHO to handle difficulties of a feature space including local optimal solutions, multi-modality, and deceptive optima.\r \r 🔗 The source codes of HHO are publicly available at https://aliasgharheidari.com/HHO.html""" ; skos:prefLabel "Harris Hawks optimization (HHO)" . :Heatmap a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Heatmap" . :HermiteActivation a skos:Concept ; skos:altLabel "Hermite Polynomial Activation" ; skos:definition """A **Hermite Activations** is a type of activation function which uses a smooth finite Hermite polynomial base as a substitute for non-smooth [ReLUs](https://paperswithcode.com/method/relu). \r \r Relevant Paper: [Lokhande et al](https://arxiv.org/pdf/1909.05479.pdf)""" ; skos:prefLabel "Hermite Activation" . :Herring a skos:Concept ; skos:definition "**Herring** is a parameter server based distributed training method. It combines AWS's Elastic Fabric [Adapter](https://paperswithcode.com/method/adapter) (EFA) with a novel parameter sharding technique that makes better use of the available network bandwidth. Herring uses EFA and balanced fusion buffer to optimally use the total bandwidth available across all nodes in the cluster. Herring reduces gradients hierarchically, reducing them inside the node first and then reducing across nodes. This enables more efficient use of PCIe bandwidth in the node and helps keep the gradient averaging related burden on GPU low." ; skos:prefLabel "Herring" . :HetPipe a skos:Concept ; skos:definition "**HetPipe** is a hybrid parallel method that integrates pipelined model parallelism (PMP) with data parallelism (DP). In HetPipe, a group of multiple GPUs, called a virtual worker, processes minibatches in a pipelined manner, and multiple such virtual workers employ data parallelism for higher performance." ; skos:prefLabel "HetPipe" . :Hi-LANDER a skos:Concept ; dcterms:source ; skos:definition "**Hi-LANDER** is a hierarchical [graph neural network](https://paperswithcode.com/methods/category/graph-models) (GNN) model that learns how to cluster a set of images into an unknown number of identities using an image annotated with labels belonging to a disjoint set of identities. The hierarchical GNN uses an approach to merge connected components predicted at each level of the hierarchy to form a new graph at the next level. Unlike fully unsupervised hierarchical clustering, the choice of grouping and complexity criteria stems naturally from supervision in the training set." ; skos:prefLabel "Hi-LANDER" . :HiFi-GAN a skos:Concept ; dcterms:source ; skos:definition """**HiFi-GAN** is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.\r \r The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every [transposed convolution](https://paperswithcode.com/method/transposed-convolution) is followed by a multi-receptive field fusion (MRF) module.\r \r For the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in [MelGAN](https://paperswithcode.com/method/melgan) is used, which consecutively evaluates audio samples at different levels.""" ; skos:prefLabel "HiFi-GAN" . :HiSD a skos:Concept ; dcterms:source ; skos:altLabel "Hierarchical Style Disentanglement" ; skos:definition "**Hierarchical Style Disentanglement**, or **HiSD**, aims to disentangle different styles in image-to-image translation models. It organizes the labels into a hierarchical structure, where independent tags, exclusive attributes, and disentangled styles are allocated from top to bottom. To make the styles identified to the tags and attributes, the authors carefully redesign the modules, phases, and objectives." ; skos:prefLabel "HiSD" . :Hierarchical-SplitBlock a skos:Concept ; dcterms:source ; skos:definition """**Hierarchical-Split Block** is a representational block for multi-scale feature representations. It contains many hierarchical split and concatenate connections within one single [residual block](https://paperswithcode.com/methods/category/skip-connection-blocks). \r \r Specifically, ordinary feature maps in deep neural networks are split into $s$ groups, each with $w$ channels. As shown in the Figure, only the first group of filters can be straightly connected to next layer. The second group of feature maps are sent to a convolution of $3 \\times 3$ filters to extract features firstly, then the output feature maps are split into two sub-groups in the channel dimension. One sub-group of feature maps straightly connected to next layer, while the other sub-group is concatenated with the next group of input feature maps in the channel dimension. The concatenated feature maps are operated by a set of $3 \\times 3$ convolutional filters. This process repeats several times until the rest of input feature maps are processed. Finally, features maps from all input groups are concatenated and sent to another layer of $1 \\times 1$ filters to rebuild the features.""" ; skos:prefLabel "Hierarchical-Split Block" . :HierarchicalFeatureFusion a skos:Concept ; dcterms:source ; skos:definition "**Hierarchical Feature Fusion (HFF)** is a feature fusion method employed in [ESP](https://paperswithcode.com/method/esp) and [EESP](https://paperswithcode.com/method/eesp) image model blocks for degridding. In the ESP module, concatenating the outputs of dilated convolutions gives the ESP module a large effective receptive field, but it introduces unwanted checkerboard or gridding artifacts. To address the gridding artifact in ESP, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them (HFF). This solution is simple and effective and does not increase the complexity of the ESP module." ; skos:prefLabel "Hierarchical Feature Fusion" . :HierarchicalMTL a skos:Concept ; dcterms:source ; skos:altLabel "Hierarchical Multi-Task Learning" ; skos:definition "Multi-task learning (MTL) introduces an inductive bias, based on a-priori relations between tasks: the trainable model is compelled to model more general dependencies by using the abovementioned relation as an important data feature. Hierarchical MTL, in which different tasks use different levels of the deep neural network, provides more effective inductive bias compared to “flat” MTL. Also, hierarchical MTL helps to solve the vanishing gradient problem in deep learning." ; skos:prefLabel "Hierarchical MTL" . :HierarchicalNetworkDissection a skos:Concept ; dcterms:source ; skos:definition "**Hierarchical Network Dissection** is a pipeline for interpreting the internal representation of face-centric inference models. Using a probabilistic formulation, Hierarchical Network Dissection pairs units of the model with concepts in a \"Face Dictionary\" (a collection of facial concepts with corresponding sample images). Interpretable units are discovered in a [convolution](https://paperswithcode.com/method/convolution) layer through HND to identify multiple instances of unit-concept affinity. The pipeline is inspired by [Network Dissection](https://paperswithcode.com/method/network-dissection), an interpretability model for object-centric and scene-centric models." ; skos:prefLabel "Hierarchical Network Dissection" . :HierarchicalSoftmax a skos:Concept ; skos:definition """**Hierarchical Softmax** is a is an alternative to [softmax](https://paperswithcode.com/method/softmax) that is faster to evaluate: it is $O\\left(\\log{n}\\right)$ time to evaluate compared to $O\\left(n\\right)$ for softmax. It utilises a multi-layer binary tree, where the probability of a word is calculated through the product of probabilities on each edge on the path to that node. See the Figure to the right for an example of where the product calculation would occur for the word "I'm".\r \r (Introduced by Morin and Bengio)\r \r Image Credit: [Steven Schmatz](https://www.quora.com/profile/Steven-Schmatz)""" ; skos:prefLabel "Hierarchical Softmax" . :HierarchicalVAE a skos:Concept ; dcterms:source ; skos:altLabel "Hierarchical Variational Autoencoder" ; skos:definition "" ; skos:prefLabel "Hierarchical VAE" . :High-levelbackbone a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "High-level backbone" . :High-resolutioninput a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "High-resolution input" . :HighwayLayer a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Highway Layer** contains an information highway to other layers that helps with information flow. It is characterised by the use of a gating unit to help this information flow. \r \r A plain feedforward neural network typically consists of $L$ layers where the $l$th layer ($l \\in ${$1, 2, \\dots, L$}) applies a nonlinear transform $H$ (parameterized by $\\mathbf{W\\_{H,l}}$) on its input $\\mathbf{x\\_{l}}$ to produce its output $\\mathbf{y\\_{l}}$. Thus, $\\mathbf{x\\_{1}}$ is the input to the network and $\\mathbf{y\\_{L}}$ is the network’s output. Omitting the layer index and biases for clarity,\r \r $$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right) $$\r \r $H$ is usually an affine transform followed by a non-linear activation function, but in general it may take other forms. \r \r For a [highway network](https://paperswithcode.com/method/highway-network), we additionally define two nonlinear transforms $T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)$ and $C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$ such that:\r \r $$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)·T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}·C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$$\r \r We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. In the original paper, the authors set $C = 1 − T$, giving:\r \r $$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)·T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}·\\left(1-T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)\\right)$$\r \r The authors set:\r \r $$ T\\left(x\\right) = \\sigma\\left(\\mathbf{W\\_{T}}^{T}\\mathbf{x} + \\mathbf{b\\_{T}}\\right) $$\r \r Image: [Sik-Ho Tsang](https://towardsdatascience.com/review-highway-networks-gating-function-to-highway-image-classification-5a33833797b5)""" ; skos:prefLabel "Highway Layer" . :HighwayNetwork a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **Highway Network** is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on \"information highways\". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions." ; skos:prefLabel "Highway Network" . :Highwaynetworks a skos:Concept ; dcterms:source ; skos:definition "There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on \"information highways\". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures." ; skos:prefLabel "Highway networks" . :Hit-Detector a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Hit-Detector** is a neural architectures search algorithm that simultaneously searches all components of an object detector in an end-to-end manner. It is a hierarchical approach to mine the proper subsearch space from the large volume of operation candidates. It consists of two main procedures. First, given a large search space containing all the operation candidates, we screen out the customized sub search space suitable for each part of detector with the help of group sparsity regularization. Secondly, we search the architectures for each part within the corresponding sub search space by adopting the differentiable manner." ; skos:prefLabel "Hit-Detector" . :HolographicReducedRepresentation a skos:Concept ; skos:definition """**Holographic Reduced Representations** are a simple mechanism to represent an associative array of key-value pairs in a fixed-size vector. Each individual key-value pair is the same size as the entire associative array; the array is represented by the sum of the pairs. Concretely, consider a complex vector key $r = (a\\_{r}[1]e^{iφ\\_{r}[1]}, a\\_{r}[2]e^{iφ\\_{r}[2]}, \\dots)$, which is the same size as the complex vector value x. The pair is "bound" together by element-wise complex multiplication, which multiplies the moduli and adds the phases of the elements:\r \r $$ y = r \\otimes x $$\r \r $$ y = \\left(a\\_{r}[1]a\\_{x}[1]e^{i(φ\\_{r}[1]+φ\\_{x}[1])}, a\\_{r}[2]a\\_{x}[2]e^{i(φ\\_{r}[2]+φ\\_{x}[2])}, \\dots\\right) $$\r \r Given keys $r\\_{1}$, $r\\_{2}$, $r\\_{3}$ and input vectors $x\\_{1}$, $x\\_{2}$, $x\\_{3}$, the associative array is:\r \r $$c = r\\_{1} \\otimes x\\_{1} + r\\_{2} \\otimes x\\_{2} + r\\_{3} \\otimes x\\_{3} $$\r \r where we call $c$ a memory trace. Define the key inverse:\r \r $$ r^{-1} = \\left(a\\_{r}[1]^{−1}e^{−iφ\\_{r}[1]}, a\\_{r}[2]^{−1}e^{−iφ\\_{r}[2]}, \\dots\\right) $$\r \r To retrieve the item associated with key $r\\_{k}$, we multiply the memory trace element-wise by the vector $r^{-1}\\_{k}$. For example: \r \r $$ r\\_{2}^{−1} \\otimes c = r\\_{2}^{-1} \\otimes \\left(r\\_{1} \\otimes x\\_{1} + r\\_{2} \\otimes x\\_{2} + r\\_{3} \\otimes x\\_{3}\\right) $$\r \r $$ r\\_{2}^{−1} \\otimes c = x\\_{2} + r^{-1}\\_{2} \\otimes \\left(r\\_{1} \\otimes x\\_{1} + r\\_{3} \\otimes x3\\right) $$\r \r $$ r\\_{2}^{−1} \\otimes c = x\\_{2} + noise $$\r \r The product is exactly $x\\_{2}$ together with a noise term. If the phases of the elements of the key vector are randomly distributed, the noise term has zero mean.\r \r Source: [Associative LSTMs](https://arxiv.org/pdf/1602.03032.pdf)""" ; skos:prefLabel "Holographic Reduced Representation" . :HopfieldLayer a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **Hopfield Layer** is a module that enables a network to associate two sets of vectors. This general functionality allows for [transformer](https://paperswithcode.com/method/transformer)-like self-attention, for decoder-encoder attention, for time series prediction (maybe with positional encoding), for sequence analysis, for multiple instance learning, for learning with point sets, for combining data sources by associations, for constructing a memory, for averaging and pooling operations, and for many more. \r \r In particular, the Hopfield layer can readily be used as plug-in replacement for existing layers like pooling layers ([max-pooling](https://paperswithcode.com/method/max-pooling) or [average pooling](https://paperswithcode.com/method/average-pooling), permutation equivariant layers, [GRU](https://paperswithcode.com/method/gru) & [LSTM](https://paperswithcode.com/method/lstm) layers, and attention layers. The Hopfield layer is based on modern Hopfield networks with continuous states that have very high storage capacity and converge after one update.""" ; skos:prefLabel "Hopfield Layer" . :HourglassModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """An **Hourglass Module** is an image block module used mainly for pose estimation tasks. The design of the hourglass is motivated by the need to capture information at every scale. While local evidence is essential for identifying features like faces and hands, a final pose estimate requires a coherent understanding of the full body. The person’s orientation, the arrangement of their limbs, and the relationships of adjacent joints are among the many cues that are best recognized at different scales in the image. The hourglass is a simple, minimal design that has the capacity to capture all of these features and bring them together to output pixel-wise predictions.\r \r The network must have some mechanism to effectively process and consolidate features across scales. The Hourglass uses a single pipeline with skip layers to preserve spatial information at each resolution. The network reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.\r \r The hourglass is set up as follows: Convolutional and [max pooling](https://paperswithcode.com/method/max-pooling) layers are used to process features down to a very low resolution. At each max pooling step, the network branches off and applies more convolutions at the original pre-pooled resolution. After reaching the lowest resolution, the network begins the top-down sequence of upsampling and combination of features across scales. To bring together information across two adjacent resolutions, we do nearest neighbor upsampling of the lower resolution followed by an elementwise addition of the two sets of features. The topology of the hourglass is symmetric, so for every layer present on the way down there is a corresponding layer going up.\r \r After reaching the output resolution of the network, two consecutive rounds of 1x1 convolutions are applied to produce the final network predictions. The output of the network is a set of heatmaps where for a given [heatmap](https://paperswithcode.com/method/heatmap) the network predicts the probability of a joint’s presence at each and every pixel.""" ; skos:prefLabel "Hourglass Module" . :Huberloss a skos:Concept ; skos:definition """The Huber loss function describes the penalty incurred by an estimation procedure f. Huber (1964) defines the loss function piecewise by[1]\r \r L δ ( a ) = { 1 2 a 2 for | a | ≤ δ , δ ⋅ ( | a | − 1 2 δ ) , otherwise. {\\displaystyle L_{\\delta }(a)={\\begin{cases}{\\frac {1}{2}}{a^{2}}&{\\text{for }}|a|\\leq \\delta ,\\\\\\delta \\cdot \\left(|a|-{\\frac {1}{2}}\\delta \\right),&{\\text{otherwise.}}\\end{cases}}}\r \r This function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where | a | = δ |a|=\\delta . The variable a often refers to the residuals, that is to the difference between the observed and predicted values a = y − f ( x ) a=y-f(x), so the former can be expanded to[2]\r \r L δ ( y , f ( x ) ) = { 1 2 ( y − f ( x ) ) 2 for | y − f ( x ) | ≤ δ , δ ⋅ ( | y − f ( x ) | − 1 2 δ ) , otherwise. {\\displaystyle L_{\\delta }(y,f(x))={\\begin{cases}{\\frac {1}{2}}(y-f(x))^{2}&{\\text{for }}|y-f(x)|\\leq \\delta ,\\\\\\delta \\ \\cdot \\left(|y-f(x)|-{\\frac {1}{2}}\\delta \\right),&{\\text{otherwise.}}\\end{cases}}}\r \r The Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it "smoothens out" the former's corner at the origin. \r \r .. math::\r \\ell(x, y) = L = \\{l_1, ..., l_N\\}^T\r \r with\r \r .. math::\r l_n = \\begin{cases}\r 0.5 (x_n - y_n)^2, & \\text{if } |x_n - y_n| < delta \\\\\r delta * (|x_n - y_n| - 0.5 * delta), & \\text{otherwise }\r \\end{cases}""" ; skos:prefLabel "Huber loss" . :Hydra a skos:Concept ; dcterms:source ; skos:definition "**Hydra** is a multi-headed neural network for model distillation with a shared body network. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. Existing distillation methods often train a distillation network to imitate the prediction of a larger network. Hydra instead learns to distill the individual predictions of each ensemble member into separate light-weight head models while amortizing the computation through a shared heavy-weight body network. This retains the diversity of ensemble member predictions which is otherwise lost in knowledge distillation." ; skos:prefLabel "Hydra" . :HypE a skos:Concept ; dcterms:source ; skos:altLabel "Hyperboloid Embeddings" ; skos:definition "Hyperboloid Embeddings (HypE) is a novel self-supervised dynamic reasoning framework, that utilizes positive first-order existential queries on a KG to learn representations of its entities and relations as hyperboloids in a Poincaré ball. HypE models the positive first-order queries as geometrical translation (t), intersection ($\\cap$), and union ($\\cup$). For the problem of KG reasoning in real-world datasets, the proposed HypE model significantly outperforms the state-of-the art results. HypE is also applied to an anomaly detection task on a popular e-commerce website product taxonomy as well as hierarchically organized web articles and demonstrate significant performance improvements compared to existing baseline methods. Finally, HypE embeddings can also be visualized in a Poincaré ball to clearly interpret and comprehend the representation space." ; skos:prefLabel "HypE" . :HyperDenseNet a skos:Concept ; dcterms:source ; skos:definition "Recently, [dense connections](https://paperswithcode.com/method/dense-connections) have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, [DenseNet](https://paperswithcode.com/method/densenet) that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning." ; skos:prefLabel "HyperDenseNet" . :HyperHyperNetwork a skos:Concept ; dcterms:source ; skos:altLabel "Hyper HyperNetwork" ; skos:definition "" ; skos:prefLabel "HyperHyperNetwork" . :HyperNetwork a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "A **HyperNetwork** is a network that generates weights for a main network. The behavior of the main network is the same with any usual neural network: it learns to map some raw inputs to their desired targets; whereas the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weight for that layer." ; skos:prefLabel "HyperNetwork" . :HyperSA a skos:Concept ; dcterms:source ; skos:altLabel "HyperGraph Self-Attention" ; skos:definition """An extension of Self-Attention to hypergraph\r Skeleton-based action recognition aims to recognize human actions given human joint coordinates with skeletal interconnections. By defining a graph with joints as vertices and their natural connections as edges, previous works successfully adopted Graph Convolutional networks (GCNs) to model joint co-occurrences and achieved superior performance. More recently, a limitation of GCNs is identified, i.e., the topology is fixed after training. To relax such a restriction, Self-Attention (SA) mechanism has been adopted to make the topology of GCNs adaptive to the input, resulting in the state-of-the-art hybrid models. Concurrently, attempts with plain Transformers have also been made, but they still lag behind state-of-the-art GCN-based methods due to the lack of structural prior. Unlike hybrid models, we propose a more elegant solution to incorporate the bone connectivity into Transformer via a graph distance embedding. Our embedding retains the information of skeletal structure during training, whereas GCNs merely use it for initialization. More importantly, we reveal an underlying issue of graph models in general, i.e., pairwise aggregation essentially ignores the high-order kinematic dependencies between body joints. To fill this gap, we propose a new self-attention (SA) mechanism on hypergraph, termed Hypergraph Self-Attention (HyperSA), to incorporate intrinsic higher-order relations into the model. We name the resulting model Hyperformer, and it beats state-of-the-art graph models w.r.t. accuracy and efficiency on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.""" ; skos:prefLabel "HyperSA" . :HyperTreeMetaModel a skos:Concept ; dcterms:source ; skos:definition "Optimize combinations of various neural network models for multimodal data with bayseian optimization." ; skos:prefLabel "HyperTree MetaModel" . :I-BERT a skos:Concept ; dcterms:source ; skos:definition """**I-BERT** is a quantized version of [BERT](https://paperswithcode.com/method/bert) that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer only approximation methods for nonlinear operations, e.g., [GELU](https://paperswithcode.com/method/gelu), [Softmax](https://paperswithcode.com/method/softmax), and [Layer Normalization](https://paperswithcode.com/method/layer-normalization), it performs an end-to-end integer-only [BERT](https://paperswithcode.com/method/bert) inference without any floating point calculation.\r \r In particular, GELU and Softmax are approximated with lightweight second-order polynomials, which can be evaluated with integer-only arithmetic. For LayerNorm, integer-only computation is performed by leveraging a known algorithm for integer calculation of\r square root.""" ; skos:prefLabel "I-BERT" . :I3DR-Net a skos:Concept ; dcterms:source ; skos:altLabel "Inflated 3D ConvNet Retina Net" ; skos:definition "" ; skos:prefLabel "I3DR-Net" . :IAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Introspective Adversarial Network" ; skos:definition """The **Introspective Adversarial Network (IAN)** is a hybridization of [GANs](https://paperswithcode.com/method/gan) and [VAEs](https://paperswithcode.com/method/vae) that leverages the power of the adversarial objective while maintaining the VAE’s efficient inference mechanism. It uses the discriminator of the GAN, $D$, as a feature extractor for an inference subnetwork, $E$, which is implemented as a fully-connected layer on top of the final convolutional layer of the discriminator. We infer latent values $Z \\sim E\\left(X\\right) = q\\left(Z\\mid{X}\\right)$ for reconstruction and sample random values $Z \\sim p\\left(Z\\right)$ from a standard normal for random image generation using the generator network, $G$.\r \r Three distinct loss functions are used:\r \r - $\\mathcal{L}\\_{img}$, the L1 pixel-wise reconstruction loss, which is preferred to the L2 reconstruction loss for its higher average gradient.\r - $\\mathcal{L\\_{feature}}$, the feature-wise reconstruction loss, evaluated as the L2 difference between the original and reconstruction in the space of the hidden layers of the discriminator.\r - $\\mathcal{L}\\_{adv}$, the ternary adversarial loss, a modification of the adversarial loss that forces the discriminator to label a sample as real, generated, or reconstructed (as opposed to a binary\r real vs. generated label).\r \r Including the VAE’s KL divergence between the inferred latents $E\\left(X\\right)$ and the prior $p\\left(Z\\right)$, the loss function for the generator and encoder network is thus:\r \r $$\\mathcal{L}\\_{E, G} = \\lambda\\_{adv}\\mathcal{L}\\_{G\\_{adv}} + \\lambda\\_{img}\\mathcal{L}\\_{img} + \\lambda\\_{feature}\\mathcal{L}\\_{feature} + D\\_{KL}\\left(E\\left(X\\right) || p\\left(Z\\right)\\right) $$\r \r Where the $\\lambda$ terms weight the relative importance of each loss. We set $\\lambda\\_{img}$ to 3 and leave the other terms at 1. The discriminator is updated solely using the ternary adversarial loss. During each training step, the generator produces reconstructions $G\\left(E\\left(X\\right)\\right)$ (using the standard VAE reparameterization trick) from data $X$ and random samples $G\\left(Z\\right)$, while the discriminator observes $X$ as well as the reconstructions and random samples, and both networks are simultaneously updated.""" ; skos:prefLabel "IAN" . :IB-BERT a skos:Concept ; dcterms:source ; skos:altLabel "Inverted Bottleneck BERT" ; skos:definition "**IB-BERT**, or **Inverted Bottleneck BERT**, is a [BERT](https://paperswithcode.com/method/bert) variant that uses an [inverted bottleneck](https://paperswithcode.com/method/inverted-residual-block) structure. It is used as a teacher network to train the [MobileBERT](https://paperswithcode.com/method/mobilebert) models." ; skos:prefLabel "IB-BERT" . :IC-SBP a skos:Concept ; dcterms:source ; skos:altLabel "Instance Colouring Stick-Breaking Process" ; skos:definition "" ; skos:prefLabel "IC-SBP" . :IFBlock a skos:Concept ; dcterms:source ; skos:definition "**IFBlock** is a video model block used in the [IFNet](https://paperswithcode.com/method/ifnet) architecture for video frame interpolation. IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 convolution and deconvolution as building blocks. Each IFBlock has a feed-forward structure consisting of several convolutional layers and an upsampling operator. Except for the layer that outputs the optical flow residuals and the fusion map, [PReLU](https://paperswithcode.com/method/prelu) activations are used." ; skos:prefLabel "IFBlock" . :IFNet a skos:Concept ; dcterms:source ; skos:definition "**IFNet** is an architecture for video frame interpolation that adopts a coarse-to-fine strategy with progressively increased resolutions: it iteratively updates intermediate flows and soft fusion mask via successive [IFBlocks](https://paperswithcode.com/method/ifblock). Conceptually, according to the iteratively updated flow fields, we can move corresponding pixels from two input frames to the same location in a latent intermediate frame and use a fusion mask to combine pixels from two input frames. Unlike most previous optical flow models, IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 [convolution](https://paperswithcode.com/method/convolution) and deconvolution as building blocks." ; skos:prefLabel "IFNet" . :IGSA a skos:Concept ; dcterms:source ; skos:altLabel "Improved Gravitational Search algorithm" ; skos:definition "Metaheuristic algorithm" ; skos:prefLabel "IGSA" . :IICNet a skos:Concept ; dcterms:source ; skos:definition "**Invertible Image Conversion Net**, or **IICNet**, is a generic framework for reversible image conversion tasks. Unlike previous encoder-decoder based methods, IICNet maintains a highly invertible structure based on invertible neural networks (INNs) to better preserve the information during conversion. It uses a relation module and a channel squeeze layer to improve the INN nonlinearity to extract cross-image relations and the network flexibility, respectively." ; skos:prefLabel "IICNet" . :ILVR a skos:Concept ; dcterms:source ; skos:altLabel "Iterative Latent Variable Refinement" ; skos:definition "**Iterative Latent Variable Refinement**, or **ILVR**, is a method to guide the generative process in denoising diffusion probabilistic models (DDPMs) to generate high-quality images based on a given reference image. ILVR conditions the generation process in well-performing unconditional DDPM. Each transition in the generation process is refined utilizing a given reference image. By matching each latent variable, ILVR ensures the given condition in each transition thus enables sampling from a conditional distribution. Thus, ILVR generates high-quality images sharing desired semantics." ; skos:prefLabel "ILVR" . :IMGEP a skos:Concept ; dcterms:source ; skos:altLabel "Intrinsically Motivated Goal Exploration Processes" ; skos:definition "Population-based intrinsically motivated goal exploration algorithms applied to real world robot learning of complex skills like tool use." ; skos:prefLabel "IMGEP" . :IMPALA a skos:Concept ; dcterms:source ; skos:definition """**IMPALA**, or the **Importance Weighted Actor Learner Architecture**, is an off-policy actor-critic framework that decouples acting from learning and learns from experience trajectories using [V-trace](https://paperswithcode.com/method/v-trace). Unlike the popular [A3C](https://paperswithcode.com/method/a3c)-based agents, in which workers communicate gradients with respect to the parameters of the policy to a central parameter server, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralized learner. Since the learner in IMPALA has access to full trajectories of experience we use a GPU to perform updates on mini-batches of trajectories while aggressively parallelising all time independent operations. \r \r This type of decoupled architecture can achieve very high throughput. However, because the policy used to generate a trajectory can lag behind the policy on the learner by several updates at the time of gradient calculation, learning becomes off-policy. The V-trace off-policy actor-critic algorithm is used to correct for this harmful discrepancy.""" ; skos:prefLabel "IMPALA" . :IPA-GNN a skos:Concept ; dcterms:source ; skos:altLabel "Instruction Pointer Attention Graph Neural Network" ; skos:definition "**Instruction Pointer Attention Graph Neural Network**, or **IPA-GNN**, is a learning-interpreter neural network (LNN) based on GNNs for learning to execute programmes. It achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution." ; skos:prefLabel "IPA-GNN" . :IPBI a skos:Concept ; dcterms:source ; skos:altLabel "Instances-Pixels Balance Index" ; skos:definition """In a given dataset for semantic image segmentation, the number of samples per class should be the same, so that no classifier would be biased towards the majority class (here included the background). It is very difficult, if not impossible, to achieve a perfect balance between the several classes of objects of a dataset. Considering that the segmentation of the objects is accomplished at the pixel level, the number of pixels for each class must be taken into account. As a matter of fact, in image semantic segmentation, \r different classes and the background may have quite different\r sizes. Therefore, the image segmentation problem is naturally unbalanced. The IPBI is based on the concept of entropy, a common measure used in many fields of science. In a general sense, it measures the amount of disorder of a system. For the sake of semantic image segmentation, the ideal dataset should have the same number of instances per class, as well as the same number of pixels in all classes. Similar reasoning can be done considering the number of pixels of all samples in a class, so that we can obtain the\r pixels balance measure for the dataset. Overall, IPBI evaluates the balance of pixels and number of instances of an image semantic segmentation dataset and, so, it is usefull to compare different datasets.""" ; skos:prefLabel "IPBI" . :IPL a skos:Concept ; dcterms:source ; skos:altLabel "Iterative Pseudo-Labeling" ; skos:definition "**Iterative Pseudo-Labeling** (IPL) is a semi-supervised algorithm for speech recognition which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine tunes an existing model at each iteration using both labeled data and a subset of unlabeled data." ; skos:prefLabel "IPL" . :IQ-Learn a skos:Concept ; skos:altLabel "Inverse Q-Learning" ; skos:definition """**Inverse Q-Learning (IQ-Learn)** is a a simple, stable & data-efficient framework for Imitation Learning (IL), that directly learns *soft Q-functions* from expert data. IQ-Learn enables **non-adverserial** imitation learning, working on both offline and online IL settings. It is performant even with very sparse expert data, and scales to complex image-based environments, surpassing prior methods by more than **3x**. \r \r It is very simple to implement requiring ~15 lines of code on top of existing RL methods.\r \r Source: [IQ-Learn: Inverse soft Q-Learning for Imitation](https://arxiv.org/abs/2106.12142)""" ; skos:prefLabel "IQ-Learn" . :IQL a skos:Concept ; dcterms:source ; skos:altLabel "Implicit Q-Learning" ; skos:definition "" ; skos:prefLabel "IQL" . :IRN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Invertible Rescaling Network" ; skos:definition """An **Invertible Rescaling Network (IRN)** is a network for image rescaling. According to the Nyquist-Shannon sampling theorem, high-frequency contents are lost during downscaling. Ideally, we hope to keep all lost information to perfectly recover the original HR image, but storing or transferring the high-frequency information is unacceptable. In order to well address this challenge, the Invertible Rescaling Net (IRN) captures some knowledge on the lost information in the form of its distribution and embeds it into model’s parameters to mitigate the ill-posedness. Given an HR image $x$, IRN not only downscales it into a LR image y, but also embeds the case-specific high-frequency content into an auxiliary case-agnostic latent variable $z$, whose marginal distribution\r obeys a fixed pre-specified distribution (e.g., isotropic Gaussian). Based on this model,\r we use a randomly drawn sample of $z$ from the pre-specified distribution for the inverse upscaling procedure, which holds the most information that one could have in upscaling.""" ; skos:prefLabel "IRN" . :ISPL a skos:Concept ; dcterms:source ; skos:altLabel "Implicit Subspace Prior Learning" ; skos:definition "**Implicit Subspace Prior Learning**, or **ISPL**, is a framework to approach dual-blind face restoration, with two major distinctions from previous restoration methods: 1) Instead of assuming an explicit degradation function between LQ and HQ domain, it establishes an implicit correspondence between both domains via a mutual embedding space, thus avoid solving the pathological inverse problem directly. 2) A subspace prior decomposition and fusion mechanism to dynamically handle inputs at varying degradation levels with consistent high-quality restoration results." ; skos:prefLabel "ISPL" . :IkshanaNet a skos:Concept ; dcterms:source ; skos:altLabel "The Ikshana Hypothesis of Human Scene Understanding Mechanism" ; skos:definition "" ; skos:prefLabel "IkshanaNet" . :ImageScaleAugmentation a skos:Concept ; skos:definition "Image Scale Augmentation is an augmentation technique where we randomly pick the short size of a image within a dimension range. One use case of this augmentation technique is in object detectiont asks." ; skos:prefLabel "Image Scale Augmentation" . :ImplicitPointRend a skos:Concept ; dcterms:source ; skos:definition "**Implicit PointRend** is a modification to the [PointRend](https://paperswithcode.com/method/pointrend) module for instance segmentation. Instead of a coarse mask prediction used in [PointRend](https://paperswithcode.com/method/pointrend) to provide region-level context to distinguish objects, for each object Implicit PointRend generates different parameters for a function that makes the final pointwise mask prediction. The new model is more straightforward than PointRend: (1) it does not require an importance point sampling during training and (2) it uses a single point-level mask loss instead of two mask losses. Implicit PointRend can be trained directly with point supervision without any intermediate prediction interpolation steps." ; skos:prefLabel "Implicit PointRend" . :InPlace-ABN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "In-Place Activated Batch Normalization" ; skos:definition "**In-Place Activated Batch Normalization**, or **InPlace-ABN**, substitutes the conventionally used succession of [BatchNorm](https://paperswithcode.com/method/batch-normalization) + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. It approximately halves the memory requirements during training of modern deep learning models." ; skos:prefLabel "InPlace-ABN" . :Inception-A a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-A** is an image model block used in the [Inception-v4](https://paperswithcode.com/method/inception-v4) architecture." ; skos:prefLabel "Inception-A" . :Inception-B a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-B** is an image model block used in the [Inception-v4](https://paperswithcode.com/method/inception-v4) architecture." ; skos:prefLabel "Inception-B" . :Inception-C a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-C** is an image model block used in the [Inception-v4](https://paperswithcode.com/method/inception-v4) architecture." ; skos:prefLabel "Inception-C" . :Inception-ResNet-v2 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-ResNet-v2** is a convolutional neural architecture that builds on the Inception family of architectures but incorporates [residual connections](https://paperswithcode.com/method/residual-connection) (replacing the filter concatenation stage of the Inception architecture)." ; skos:prefLabel "Inception-ResNet-v2" . :Inception-ResNet-v2-A a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-ResNet-v2-A** is an image model block for a 35 x 35 grid used in the [Inception-ResNet-v2](https://paperswithcode.com/method/inception-resnet-v2) architecture." ; skos:prefLabel "Inception-ResNet-v2-A" . :Inception-ResNet-v2-B a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-ResNet-v2-B** is an image model block for a 17 x 17 grid used in the [Inception-ResNet-v2](https://paperswithcode.com/method/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections." ; skos:prefLabel "Inception-ResNet-v2-B" . :Inception-ResNet-v2-C a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-ResNet-v2-C** is an image model block for an 8 x 8 grid used in the [Inception-ResNet-v2](https://paperswithcode.com/method/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections." ; skos:prefLabel "Inception-ResNet-v2-C" . :Inception-ResNet-v2Reduction-B a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-ResNet-v2 Reduction-B** is an image model block used in the [Inception-ResNet-v2](https://paperswithcode.com/method/inception-resnet-v2) architecture." ; skos:prefLabel "Inception-ResNet-v2 Reduction-B" . :Inception-v3 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https://paperswithcode.com/method/label-smoothing), Factorized 7 x 7 convolutions, and the use of an auxiliary classifer to propagate label information lower down the network (along with the use of [batch normalization](https://paperswithcode.com/method/batch-normalization) for layers in the sidehead)." ; skos:prefLabel "Inception-v3" . :Inception-v3Module a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-v3 Module** is an image block used in the [Inception-v3](https://paperswithcode.com/method/inception-v3) architecture. This architecture is used on the coarsest (8 × 8) grids to promote high dimensional representations." ; skos:prefLabel "Inception-v3 Module" . :Inception-v4 a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Inception-v4** is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than [Inception-v3](https://paperswithcode.com/method/inception-v3)." ; skos:prefLabel "Inception-v4" . :InceptionModule a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "An **Inception Module** is an image model block that aims to approximate an optimal local sparse structure in a CNN. Put simply, it allows for us to use multiple types of filter size, instead of being restricted to a single filter size, in a single image block, which we then concatenate and pass onto the next layer." ; skos:prefLabel "Inception Module" . :InceptionTime a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "InceptionTime" . :Inceptionv2 a skos:Concept ; dcterms:source ; skos:definition "**Inception v2** is the second generation of Inception convolutional neural network architectures which notably uses [batch normalization](https://paperswithcode.com/method/batch-normalization). Other changes include dropping [dropout](https://paperswithcode.com/method/dropout) and removing [local response normalization](https://paperswithcode.com/method/local-response-normalization), due to the benefits of batch normalization." ; skos:prefLabel "Inception v2" . :InfoGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**InfoGAN** is a type of generative adversarial network that modifies the [GAN](https://paperswithcode.com/method/gan) objective to\r encourage it to learn interpretable and meaningful representations. This is done by maximizing the\r mutual information between a fixed small subset of the GAN’s noise variables and the observations.\r \r Formally, InfoGAN is defined as a minimax game with a variational regularization of mutual information and the hyperparameter $\\lambda$:\r \r $$ \\min\\_{G, Q}\\max\\_{D}V\\_{INFOGAN}\\left(D, G, Q\\right) = V\\left(D, G\\right) - \\lambda{L}\\_{I}\\left(G, Q\\right) $$\r \r Where $Q$ is an auxiliary distribution that approximates the posterior $P\\left(c\\mid{x}\\right)$ - the probability of the latent code $c$ given the data $x$ - and $L\\_{I}$ is the variational lower bound of the mutual information between the latent code and the observations.\r \r In the practical implementation, there is another fully-connected layer to output parameters for the conditional distribution $Q$ (negligible computation ontop of regular GAN structures). Q is represented with a [softmax](https://paperswithcode.com/method/softmax) non-linearity for a categorical latent code. For a continuous latent code, the authors assume a factored Gaussian.""" ; skos:prefLabel "InfoGAN" . :InfoNCE a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**InfoNCE**, where NCE stands for Noise-Contrastive Estimation, is a type of contrastive loss function used for [self-supervised learning](https://paperswithcode.com/methods/category/self-supervised-learning).\r \r Given a set $X = ${$x\\_{1}, \\dots, x\\_{N}$} of $N$ random samples containing one positive sample from $p\\left(x\\_{t+k}|c\\_{t}\\right)$ and $N − 1$ negative samples from the 'proposal' distribution $p\\left(x\\_{t+k}\\right)$, we optimize:\r \r $$ \\mathcal{L}\\_{N} = - \\mathbb{E}\\_{X}\\left[\\log\\frac{f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)}{\\sum\\_{x\\_{j}\\in{X}}f\\_{k}\\left(x\\_{j}, c\\_{t}\\right)}\\right] $$\r \r Optimizing this loss will result in $f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)$ estimating the density ratio, which is:\r \r $$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$""" ; skos:prefLabel "InfoNCE" . :InformativeSampleMiningNetwork a skos:Concept ; dcterms:source ; skos:definition "**Informative Sample Mining Network** is a multi-stage sample training scheme for GANs to reduce sample hardness while preserving sample informativeness. Adversarial Importance Weighting is proposed to select informative samples and assign them greater weight. The authors also propose Multi-hop Sample Training to avoid the potential problems in model training caused by sample mining. Based on the principle of divide-and-conquer, the authors produce target images by multiple hops, which means the image translation is decomposed into several separated steps." ; skos:prefLabel "Informative Sample Mining Network" . :Inpainting a skos:Concept ; dcterms:source ; skos:definition "Train a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings." ; skos:prefLabel "Inpainting" . :InstaBoost a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**InstaBoost** is a data augmentation technique for instance segmentation that utilises existing instance mask annotations.\r \r Intuitively in a small neighbor area of $(x_0, y_0, 1, 0)$, the probability map $P(x, y, s, r)$ should be high-valued since images are usually continuous and redundant in pixel level. Based on this, InstaBoost is a form of augmentation where we apply object jittering that randomly samples transformation tuples from the neighboring space of identity transform $(x_0, y_0, 1, 0)$ and paste the cropped object following affine transform $\\mathbf{H}$.""" ; skos:prefLabel "InstaBoost" . :Instance-LevelMetaNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Instance-Level Meta Normalization** is a normalization method that addresses a learning-to-normalize problem. ILM-Norm learns to predict the normalization parameters via both the feature feed-forward and the gradient back-propagation paths. It uses an auto-encoder to predict the weights $\\omega$ and bias $\\beta$ as the rescaling parameters for recovering the distribution of the tensor $x$ of feature maps. Instead of using the entire feature tensor $x$ as the input for the auto-encoder, it uses the mean $\\mu$ and variance $\\gamma$ of $x$ for characterizing its statistics." ; skos:prefLabel "Instance-Level Meta Normalization" . :InstanceNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Instance Normalization** (also known as contrast normalization) is a normalization layer where:\r \r $$\r y_{tijk} = \\frac{x_{tijk} - \\mu_{ti}}{\\sqrt{\\sigma_{ti}^2 + \\epsilon}},\r \\quad\r \\mu_{ti} = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H x_{tilm},\r \\quad\r \\sigma_{ti}^2 = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H (x_{tilm} - \\mu_{ti})^2.\r $$\r \r This prevents instance-specific mean and covariance shift simplifying the learning process. Intuitively, the normalization process allows to remove instance-specific contrast information from the content image in a task like image stylization, which simplifies generation.""" ; skos:prefLabel "Instance Normalization" . :InterBERT a skos:Concept ; dcterms:source ; skos:definition "InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching." ; skos:prefLabel "InterBERT" . :InternVideo a skos:Concept ; skos:altLabel "InternVideo: General Video Foundation Models via Generative and Discriminative Learning" ; skos:definition "The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo." ; skos:prefLabel "InternVideo" . :InternetExplorer a skos:Concept ; dcterms:source ; skos:definition "Internet Explorer explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next." ; skos:prefLabel "Internet Explorer" . :InverseSquareRootSchedule a skos:Concept ; skos:definition """**Inverse Square Root** is a learning rate schedule 1 / $\\sqrt{\\max\\left(n, k\\right)}$ where\r $n$ is the current training iteration and $k$ is the number of warm-up steps. This sets a constant learning rate for the first $k$ steps, then exponentially decays the learning rate until pre-training is over.""" ; skos:prefLabel "Inverse Square Root Schedule" . :InvertedResidualBlock a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """An **Inverted Residual Block**, sometimes called an **MBConv Block**, is a type of residual block used for image models that uses an inverted structure for efficiency reasons. It was originally proposed for the [MobileNetV2](https://paperswithcode.com/method/mobilenetv2) CNN architecture. It has since been reused for several mobile-optimized CNNs.\r \r A traditional [Residual Block](https://paperswithcode.com/method/residual-block) has a wide -> narrow -> wide structure with the number of channels. The input has a high number of channels, which are compressed with a [1x1 convolution](https://paperswithcode.com/method/1x1-convolution). The number of channels is then increased again with a 1x1 [convolution](https://paperswithcode.com/method/convolution) so input and output can be added. \r \r In contrast, an Inverted Residual Block follows a narrow -> wide -> narrow approach, hence the inversion. We first widen with a 1x1 convolution, then use a 3x3 [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution) (which greatly reduces the number of parameters), then we use a 1x1 convolution to reduce the number of channels so input and output can be added.""" ; skos:prefLabel "Inverted Residual Block" . :Invertible1x1Convolution a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """The **Invertible 1x1 Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) used in flow-based generative models that reverses the ordering of channels. The weight matrix is initialized as a random rotation matrix. The log-determinant of an invertible 1 × 1 convolution of a $h \\times w \\times c$ tensor $h$ with $c \\times c$ weight matrix $\\mathbf{W}$ is straightforward to compute:\r \r $$ \\log | \\text{det}\\left(\\frac{d\\text{conv2D}\\left(\\mathbf{h};\\mathbf{W}\\right)}{d\\mathbf{h}}\\right) | = h \\cdot w \\cdot \\log | \\text{det}\\left(\\mathbf{W}\\right) | $$""" ; skos:prefLabel "Invertible 1x1 Convolution" . :InvertibleNxNConvolution a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "Invertible NxN Convolution" . :Involution a skos:Concept ; dcterms:source ; skos:definition """**Involution** is an atomic operation for deep neural networks that inverts the design principles of convolution. Involution kernels are distinct in the spatial extent but shared across channels. If involution kernels are parameterized as fixed-sized matrices like convolution kernels and updated using the back-propagation algorithm, the learned involution kernels are impeded from transferring between input images with variable resolutions. \r \r The authors argue for two benefits of involution over convolution: (i) involution can summarize the context in a wider spatial arrangement, thus overcome the difficulty of modeling long-range interactions well; (ii) involution can adaptively allocate the weights over different positions, so as to prioritize the most informative visual elements in the spatial domain.""" ; skos:prefLabel "Involution" . :IoU-BalancedSampling a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**IoU-Balanced Sampling** is hard mining method for object detection. Suppose we need to sample $N$ negative samples from $M$ corresponding candidates. The selected probability for each sample under random sampling is:\r \r $$ p = \\frac{N}{M} $$\r \r To raise the selected probability of hard negatives, we evenly split the sampling interval into $K$ bins according to IoU. $N$ demanded negative samples are equally distributed to each bin. Then we select samples from them uniformly. Therefore, we get the selected probability under IoU-balanced sampling:\r \r $$ p\\_{k} = \\frac{N}{K}*\\frac{1}{M\\_{k}}\\text{ , } k\\in\\left[0, K\\right)$$\r \r where $M\\_{k}$ is the number of sampling candidates in the corresponding interval denoted by $k$. $K$ is set to 3 by default in our experiments.\r \r The sampled histogram with IoU-balanced sampling is shown by green color in the Figure to the right. The IoU-balanced sampling can guide the distribution of training samples close to the one of hard negatives.""" ; skos:prefLabel "IoU-Balanced Sampling" . :IoU-Net a skos:Concept ; dcterms:source ; skos:definition "**IoU-Net** is an object detection architecture that introduces localization confidence. IoU-Net learns to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective." ; skos:prefLabel "IoU-Net" . :IoU-guidedNMS a skos:Concept ; dcterms:source ; skos:definition "**IoU-guided NMS** is a type of non-maximum suppression that help to eliminate the suppression failure caused by the misleading classification confidences. This is achieved through using the predicted IoU instead of the classification confidence as the ranking keyword for bounding boxes. " ; skos:prefLabel "IoU-guided NMS" . :IterInpaint a skos:Concept ; dcterms:source ; skos:altLabel "Iterative Inpainting" ; skos:definition "" ; skos:prefLabel "IterInpaint" . :JLA a skos:Concept ; dcterms:source ; skos:altLabel "Joint Learning Architecture" ; skos:definition "**JLA**, or **Joint Learning Architecture**, is an approach for multiple object tracking and trajectory forecasting. It jointly trains a tracking and trajectory forecasting model, and the trajectory forecasts are used for short-term motion estimates in lieu of linear motion prediction methods such as the Kalman filter. It uses a [FairMOT](https://paperswithcode.com/method/fairmot) model as the base model because this architecture already performs detection and tracking. A forecasting branch is added to the network and is trained end-to-end. [FairMOT](https://paperswithcode.com/method/fairmot) consist of a backbone network utilizing [Deep Layer Aggregation](https://www.paperswithcode.com/method/dla), an object detection head, and a reID head." ; skos:prefLabel "JLA" . :Jigsaw a skos:Concept ; dcterms:source ; skos:definition "**Jigsaw** is a self-supervision approach that relies on jigsaw-like puzzles as the pretext task in order to learn image representations." ; skos:prefLabel "Jigsaw" . :Jukebox a skos:Concept ; dcterms:source ; skos:definition """**Jukebox** is a model that generates music with singing in the raw audio domain. It tackles the long context of raw audio using a multi-scale [VQ-VAE](https://paperswithcode.com/method/vq-vae) to compress it to discrete codes, and modeling those using [autoregressive Transformers](https://paperswithcode.com/methods/category/autoregressive-transformers). It can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.\r \r Three separate VQ-VAE models are trained with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors $\\mathbf{h}\\_{t}$, which are then quantized to the closest codebook vectors $\\mathbf{e}\\_{z\\_{t}}$. The code $z\\_{t}$ is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same. Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in the highest-quality audio.""" ; skos:prefLabel "Jukebox" . :K-MaximalWordAllocation a skos:Concept ; dcterms:source ; skos:definition "" ; skos:prefLabel "K-Maximal Word Allocation" . :K-Net a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**K-Net** is a framework for unified semantic and instance segmentation that segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. It begins with a set of kernels that are randomly initialized, and learns the kernels in accordance to the segmentation targets at hand, namely, semantic kernels for semantic categories and instance kernels for instance identities. A simple combination of semantic kernels and instance kernels allows panoptic segmentation naturally. In the forward pass, the kernels perform [convolution](https://paperswithcode.com/method/convolution) on the image features to obtain the corresponding segmentation predictions.\r \r K-Net is formulated so that it dynamically updates the kernels to make them conditional to their activations on the image. Such a content-aware mechanism is crucial to ensure that each kernel, especially an instance kernel, responds accurately to varying objects in an image. Through applying this adaptive kernel update strategy iteratively, K-Net significantly improves the discriminative ability of the kernels and boosts the final segmentation performance. It is noteworthy that this strategy universally applies to kernels for all the segmentation tasks.\r \r It also utilises a bipartite matching strategy to assign learning targets for each kernel. This training approach is advantageous to conventional training strategies as it builds a one-to-one mapping between kernels and instances in an image. It thus resolves the problem of dealing with a varying number of instances in an image. In addition, it is purely mask-driven without involving boxes. Hence, K-Net is naturally [NMS](https://paperswithcode.com/method/non-maximum-suppression)-free and box-free, which is appealing to real-time applications.""" ; skos:prefLabel "K-Net" . :K3M a skos:Concept ; dcterms:source ; skos:definition "**K3M** is a multi-modal pretraining method for e-commerce product data that introduces knowledge modality to correct the noise and supplement the missing of image and text modalities. The modal-encoding layer extracts the features of each modality. The modal-interaction layer is capable of effectively modeling the interaction of multiple modalities, where an initial-interactive feature fusion model is designed to maintain the independence of image modality and text modality, and a structure aggregation module is designed to fuse the information of image, text, and knowledge modalities. K3M is pre-trained with three pretraining tasks, including masked object modeling (MOM), masked language modeling (MLM), and link prediction modeling ([LPM](https://paperswithcode.com/method/local-prior-matching))." ; skos:prefLabel "K3M" . :KAF a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Kernel Activation Function" ; skos:definition """A **Kernel Activation Function** is a non-parametric activation function defined as a one-dimensional kernel approximator:\r \r $$ f(s) = \\sum_{i=1}^D \\alpha_i \\kappa( s, d_i) $$\r \r where:\r \r 1. The dictionary of the kernel elements $d_0, \\ldots, d_D$ is fixed by sampling the $x$-axis with a uniform step around 0.\r 2. The user selects the kernel function (e.g., Gaussian, [ReLU](https://paperswithcode.com/method/relu), [Softplus](https://paperswithcode.com/method/softplus)) and the number of kernel elements $D$ as a hyper-parameter. A larger dictionary leads to more expressive activation functions and a larger number of trainable parameters.\r 3. The linear coefficients are adapted independently at every neuron via standard back-propagation.\r \r In addition, the linear coefficients can be initialized using kernel ridge regression to behave similarly to a known function in the beginning of the optimization process.""" ; skos:prefLabel "KAF" . :KE-MLM a skos:Concept ; dcterms:source ; skos:altLabel "Knowledge Enhanced Masked Language Model" ; skos:definition "" ; skos:prefLabel "KE-MLM" . :KGRefiner a skos:Concept ; dcterms:source ; skos:altLabel "Knowledge Graph Refiner" ; skos:definition "" ; skos:prefLabel "KGRefiner" . :KIP a skos:Concept ; dcterms:source ; skos:altLabel "Kernel Inducing Points" ; skos:definition "**Kernel Inducing Points**, or **KIP**, is a meta-learning algorithm for learning datasets that can mitigate the challenges which occur for naturally occurring datasets without a significant sacrifice in performance. KIP uses kernel-ridge regression to learn $\\epsilon$-approximate datasets. It can be regarded as an adaption of the inducing point method for Gaussian processes to the case of Kernel Ridge Regression." ; skos:prefLabel "KIP" . :KNNandIOUbasedverification a skos:Concept ; dcterms:source ; skos:definition "**KNN and IoU-based Verification** is used to verify detections and choose between multiple detections of the same underlying object. It was originally used within the context of blood cell counting in medical images. To avoid this double counting problem, the KNN algorithm is applied in each platelet to determine its closest platelet and then using the intersection of union (IOU) between two platelets we calculate their extent of overlap. The authors allow 10% of the overlap between platelet and its closest platelet based on empirical observations. If the overlap is larger than that, they ignore that cell as a double count to get rid of spurious counting." ; skos:prefLabel "KNN and IOU based verification" . :KOVA a skos:Concept ; dcterms:source ; skos:altLabel "Kalman Optimization for Value Approximation" ; skos:definition "**Kalman Optimization for Value Approximation**, or **KOVA** is a general framework for addressing uncertainties while approximating value-based functions in deep RL domains. KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties. It is feasible when using non-linear approximation functions as DNNs and can estimate the value in both on-policy and off-policy settings. It can be incorporated as a policy evaluation component in policy optimization algorithms." ; skos:prefLabel "KOVA" . :KP a skos:Concept ; dcterms:source ; skos:altLabel "Kollen-Pollack Learning" ; skos:definition "" ; skos:prefLabel "KP" . :KPE a skos:Concept ; dcterms:source ; skos:altLabel "Keypoint Pose Encoding" ; skos:definition "" ; skos:prefLabel "KPE" . :KaimingInitialization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Kaiming Initialization**, or **He Initialization**, is an initialization method for neural networks that takes into account the non-linearity of activation functions, such as [ReLU](https://paperswithcode.com/method/relu) activations.\r \r A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. Using a derivation they work out that the condition to stop this happening is:\r \r $$\\frac{1}{2}n\\_{l}\\text{Var}\\left[w\\_{l}\\right] = 1 $$\r \r This implies an initialization scheme of:\r \r $$ w\\_{l} \\sim \\mathcal{N}\\left(0, 2/n\\_{l}\\right)$$\r \r That is, a zero-centered Gaussian with standard deviation of $\\sqrt{2/{n}\\_{l}}$ (variance shown in equation above). Biases are initialized at $0$.""" ; skos:prefLabel "Kaiming Initialization" . :Kaleido-BERT a skos:Concept ; rdfs:seeAlso ; skos:definition "**Kaleido-BERT**(CVPR2021) is the pioneering work that focus on solving PTM in e-commerce field. It achieves SOTA performances compared with many models published in general domain." ; skos:prefLabel "Kaleido-BERT" . :KnowPrompt a skos:Concept ; dcterms:source ; skos:definition "**KnowPrompt** is a prompt-tuning approach for relational understanding. It injects entity and relation knowledge into prompt construction with learnable virtual template words as well as answer words and synergistically optimize their representation with knowledge constraints. To be specific, TYPED MARKER is utilized around entities initialized with aggregated entity-type embeddings as learnable virtual template words to inject entity type knowledge. The average embeddings of each token are leveraged in relation labels as virtual answer words to inject relation knowledge. Since there exist implicit structural constraints among entities and relations, and virtual words should be consistent with the surrounding contexts, synergistic optimization is introduced to obtain optimized virtual templates and answer words. Concretely, a context-aware prompt calibration method is used with implicit structural constraints to inject structural knowledge implications among relational triples and associate prompt embeddings with each other." ; skos:prefLabel "KnowPrompt" . :KnowledgeDistillation a skos:Concept ; dcterms:source ; skos:definition """A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\r Source: [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)""" ; skos:prefLabel "Knowledge Distillation" . :KungFu a skos:Concept ; skos:definition """**KungFu** is a distributed ML library for TensorFlow that is designed to enable adaptive training. KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the dataflow graph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency\r of monitoring and adaptation operations.""" ; skos:prefLabel "KungFu" . :L-GCN a skos:Concept ; dcterms:source ; skos:altLabel "Learnable adjacency matrix GCN" ; skos:definition "Graph structure is learnable" ; skos:prefLabel "L-GCN" . :L1Regularization a skos:Concept ; skos:definition """**$L_{1}$ Regularization** is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{1}$ Norm of the weights:\r \r $$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{||w||}\\_{1}$$\r \r where $\\lambda$ is a value determining the strength of the penalty. In contrast to [weight decay](https://paperswithcode.com/method/weight-decay), $L_{1}$ regularization promotes sparsity; i.e. some parameters have an optimal value of zero.\r \r Image Source: [Wikipedia](https://en.wikipedia.org/wiki/Regularization_(mathematics)#/media/File:Sparsityl1.png)""" ; skos:prefLabel "L1 Regularization" . :L2M a skos:Concept ; dcterms:source ; skos:altLabel "Learning to Match" ; skos:definition "**L2M** is a learning algorithm that can work for most cross-domain distribution matching tasks. It automatically learns the cross-domain distribution matching without relying on hand-crafted priors on the matching loss. Instead, L2M reduces the inductive bias by using a meta-network to learn the distribution matching loss in a data-driven way." ; skos:prefLabel "L2M" . :LAMA a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Low-Rank Factorization-based Multi-Head Attention" ; skos:definition "**Low-Rank Factorization-based Multi-head Attention Mechanism**, or **LAMA**, is a type of attention module that uses low-rank factorization to reduce computational complexity. It uses low-rank bilinear pooling to construct a structured sentence representation that attends to multiple aspects of a sentence." ; skos:prefLabel "LAMA" . :LAMB a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**LAMB** is a a layerwise adaptive large batch optimization technique. It provides a strategy for adapting the learning rate in large batch settings. LAMB uses [Adam](https://paperswithcode.com/method/adam) as the base algorithm and then forms an update as:\r \r $$r\\_{t} = \\frac{m\\_{t}}{\\sqrt{v\\_{t}} + \\epsilon}$$\r $$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)} - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }\\left(r\\_{t}^{\\left(i\\right)}+\\lambda{x\\_{t}^{\\left(i\\right)}}\\right) $$\r \r Unlike [LARS](https://paperswithcode.com/method/lars), the adaptivity of LAMB is two-fold: (i) per dimension normalization with respect to the square root of the second moment used in Adam and (ii) layerwise normalization obtained due to layerwise adaptivity.""" ; skos:prefLabel "LAMB" . :LAPGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """A **LAPGAN**, or **Laplacian Generative Adversarial Network**, is a type of generative adversarial network that has a [Laplacian pyramid](https://paperswithcode.com/method/laplacian-pyramid) representation. In the sampling procedure following training, we have a set of generative convnet models {$G\\_{0}, \\dots , G\\_{K}$}, each of which captures the distribution of coefficients $h\\_{k}$ for natural images at a different level of the Laplacian pyramid. Sampling an image is akin to a reconstruction procedure, except that the generative\r models are used to produce the $h\\_{k}$’s:\r \r $$ \\tilde{I}\\_{k} = u\\left(\\tilde{I}\\_{k+1}\\right) + \\tilde{h}\\_{k} = u\\left(\\tilde{I}\\_{k+1}\\right) + G\\_{k}\\left(z\\_{k}, u\\left(\\tilde{I}\\_{k+1}\\right)\\right)$$\r \r The recurrence starts by setting $\\tilde{I}\\_{K+1} = 0$ and using the model at the final level $G\\_{K}$ to generate a residual image $\\tilde{I}\\_{K}$ using noise vector $z\\_{K}$: $\\tilde{I}\\_{K} = G\\_{K}\\left(z\\_{K}\\right)$. Models at all levels except the final are conditional generative models that take an upsampled version of the current image $\\tilde{I}\\_{k+1}$ as a conditioning variable, in addition to the noise vector $z\\_{k}$.\r \r The generative models {$G\\_{0}, \\dots, G\\_{K}$} are trained using the CGAN approach at each level of the pyramid. Specifically, we construct a Laplacian pyramid from each training image $I$. At each level we make a stochastic choice (with equal probability) to either (i) construct the coefficients $h\\_{k}$ either using the standard Laplacian pyramid coefficient generation procedure or (ii) generate them using $G\\_{k}:\r \r $$ \\tilde{h}\\_{k} = G\\_{k}\\left(z\\_{k}, u\\left(I\\_{k+1}\\right)\\right) $$\r \r Here $G\\_{k}$ is a convnet which uses a coarse scale version of the image $l\\_{k} = u\\left(I\\_{k+1}\\right)$ as an input, as well as noise vector $z\\_{k}$. $D\\_{k}$ takes as input $h\\_{k}$ or $\\tilde{h}\\_{k}$, along with the low-pass image $l\\_{k}$ (which is explicitly added to $h\\_{k}$ or $\\tilde{h}\\_{k}$ before the first [convolution](https://paperswithcode.com/method/convolution) layer), and predicts if the image was real or\r generated. At the final scale of the pyramid, the low frequency residual is sufficiently small that it\r can be directly modeled with a standard [GAN](https://paperswithcode.com/method/gan): $\\tilde{h}\\_{K} = G\\_{K}\\left(z\\_{K}\\right)$ and $D\\_{K}$ only has $h\\_{K}$ or $\\tilde{h}\\_{K}$ as input.\r \r Breaking the generation into successive refinements is the key idea. We give up any “global” notion of fidelity; an attempt is never made to train a network to discriminate between the output of a cascade and a real image and instead the focus is on making each step plausible.""" ; skos:prefLabel "LAPGAN" . :LARS a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Layer-wise Adaptive Rate Scaling**, or **LARS**, is a large batch optimization technique. There are two notable differences between LARS and other adaptive algorithms such as [Adam](https://paperswithcode.com/method/adam) or [RMSProp](https://paperswithcode.com/method/rmsprop): first, LARS uses a separate learning rate for each layer and not for each weight. And second, the magnitude of the update is controlled with respect to the weight norm for better control of training speed.\r \r $$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)\\left(g\\_{t} + \\lambda{x\\_{t}}\\right)$$\r $$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)} - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }m\\_{t}^{\\left(i\\right)} $$""" ; skos:prefLabel "LARS" . :LCC a skos:Concept ; dcterms:source ; skos:altLabel "Lipschitz Constant Constraint" ; skos:definition "Please enter a description about the method here" ; skos:prefLabel "LCC" . :LFME a skos:Concept ; dcterms:source ; skos:altLabel "Learning From Multiple Experts" ; skos:definition "**Learning From Multiple Experts** is a self-paced knowledge distillation framework that aggregates the knowledge from multiple 'Experts' to learn a unified student model. Specifically, the proposed framework involves two levels of adaptive learning schedules: Self-paced Expert Selection and Curriculum Instance Selection, so that the knowledge is adaptively transferred to the 'Student'. The self-paced expert selection automatically controls the impact of knowledge distillation from each expert, so that the learned student model will gradually acquire the knowledge from the experts, and finally exceed the expert. The curriculum instance selection, on the other hand, designs a curriculum for the unified model where the training samples are organized from easy to hard, so that the unified student model will receive a less challenging learning schedule, and gradually learns from easy to hard samples." ; skos:prefLabel "LFME" . :LFPNet\(TTA\) a skos:Concept ; dcterms:source ; skos:altLabel "LFPNet with test time augmentation" ; skos:definition "" ; skos:prefLabel "LFPNet (TTA)" . :LGCL a skos:Concept ; dcterms:source ; skos:altLabel "Learnable graph convolutional layer" ; skos:definition """Learnable graph convolutional layer (LGCL) automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs.\r \r Description and image from: [Large-Scale Learnable Graph Convolutional Networks](https://arxiv.org/pdf/1808.03965.pdf)""" ; skos:prefLabel "LGCL" . :LIME a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Local Interpretable Model-Agnostic Explanations" ; skos:definition """**LIME**, or **Local Interpretable Model-Agnostic Explanations**, is an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model. It modifies a single data sample by tweaking the feature values and observes the resulting impact on the output. It performs the role of an "explainer" to explain predictions from each data sample. The output of LIME is a set of explanations representing the contribution of each feature to a prediction for a single sample, which is a form of local interpretability.\r \r Interpretable models in LIME can be, for instance, [linear regression](https://paperswithcode.com/method/linear-regression) or decision trees, which are trained on small perturbations (e.g. adding noise, removing words, hiding parts of the image) of the original model to provide a good local approximation.""" ; skos:prefLabel "LIME" . :LIMix a skos:Concept ; dcterms:source ; skos:altLabel "Lifelong Infinite Mixture" ; skos:definition "**LIMix**, or **Lifelong Infinite Mixture**, is a lifelong learning model which grows a mixture of models to adapt to an increasing number of tasks. LIMix can automatically expand its network architectures or choose an appropriate component to adapt its parameters for learning a new task, while preserving its previously learnt information. Knowledge is incorporated by means of Dirichlet processes by using a gating mechanism which computes the dependence between the knowledge learnt previously and stored in each component, and a new set of data. Besides, a Student model is trained which can accumulate cross-domain representations over time and make quick inferences." ; skos:prefLabel "LIMix" . :LINE a skos:Concept ; dcterms:source ; skos:altLabel "Large-scale Information Network Embedding" ; skos:definition """LINE is a novel network embedding method which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures.\r \r Source: [Tang et al.](https://arxiv.org/pdf/1503.03578v1.pdf)\r \r Image source: [Tang et al.](https://arxiv.org/pdf/1503.03578v1.pdf)""" ; skos:prefLabel "LINE" . :LLaMA a skos:Concept ; dcterms:source ; skos:definition """**LLaMA** is a collection of foundation language models ranging from 7B to 65B parameters. It is based on the transformer architecture with various improvements that were subsequently proposed. The main difference with the original architecture are listed below.\r \r - RMSNorm normalizing function is used to improve the training stability, by normalizing the input of each transformer sub-layer, instead of normalizing the output.\r - The ReLU non-linearity is replaced by the SwiGLU activation function to improve performance.\r - Absolute positional embeddings are removed and instead rotary positional embeddings (RoPE) are added at each layer of the network.""" ; skos:prefLabel "LLaMA" . :LMOT a skos:Concept ; rdfs:seeAlso ; skos:altLabel "LMOT: Efficient Light-Weight Detection and Tracking in Crowds" ; skos:definition """Rana Mostafa, Hoda Baraka and AbdelMoniem Bayoumi\r \r **LMOT**, i.e., Light-weight Multi-Object Tracker, performs joint pedestrian detection and tracking. LMOT introduces a simplified DLA-34 encoder network to extract detection features for the current image that are computationally efficient. Furthermore, we generate efficient tracking features using a linear transformer for the prior image frame and its corresponding detection heatmap. After that, LMOT fuses both detection and tracking feature maps in a multi-layer scheme and performs a two-stage online data association relying on the Kalman filter to generate tracklets. We evaluated our model on the challenging real-world MOT16/17/20 datasets, showing LMOT significantly outperforms the state-of-the-art trackers concerning runtime while maintaining high robustness. LMOT is approximately ten times faster than state-of-the-art trackers while being only 3.8% behind in performance accuracy on average leading to a much computationally lighter model.\r \r Code: https://github.com/RanaMostafaAbdElMohsen/LMOT\r Paper: https://doi.org/10.1109/ACCESS.2022.3197157""" ; skos:prefLabel "LMOT" . :LMU a skos:Concept ; dcterms:source ; skos:altLabel "Legendre Memory Unit" ; skos:definition """The Legendre Memory Unit (LMU) is mathematically derived to orthogonalize\r its continuous-time history – doing so by solving d coupled ordinary differential\r equations (ODEs), whose phase space linearly maps onto sliding windows of\r time via the Legendre polynomials up to degree d-1. It is optimal for compressing temporal information.\r \r See paper for equations (markdown isn't working).\r \r Official github repo: [https://github.com/abr/lmu](https://github.com/abr/lmu)""" ; skos:prefLabel "LMU" . :LOGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**LOGAN** is a generative adversarial network that uses a latent optimization approach using [natural gradient descent](https://paperswithcode.com/method/natural-gradient-descent) (NGD). For the Fisher matrix in NGD, the authors use the empirical Fisher $F'$ with Tikhonov damping:\r \r $$ F' = g \\cdot g^{T} + \\beta{I} $$\r \r They also use Euclidian Norm regularization for the optimization step.\r \r For LOGAN's base architecture, [BigGAN-deep](https://paperswithcode.com/method/biggan-deep) is used with a few modifications: increasing the size of the latent source from $186$ to $256$, to compensate the randomness of the source lost\r when optimising $z$. 2, using the uniform distribution $U\\left(−1, 1\\right)$ instead of the standard normal distribution $N\\left(0, 1\\right)$ for $p\\left(z\\right)$ to be consistent with the clipping operation, using leaky [ReLU](https://paperswithcode.com/method/relu) (with the slope of 0.2 for the negative part) instead of ReLU as the non-linearity for smoother gradient flow for $\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}}$ .""" ; skos:prefLabel "LOGAN" . :LPM a skos:Concept ; dcterms:source ; skos:altLabel "Local Prior Matching" ; skos:definition "**Local Prior Matching** is a semi-supervised objective for speech recognition that distills knowledge from a strong prior (e.g. a language model) to provide learning signal to a discriminative model trained on unlabeled speech. The LPM objective minimizes the cross entropy between the local prior and the model distribution, and is minimized when $q\\_{y\\mid{x}} = \\bar{p}\\_{y\\mid{x}}$. Intuitively, LPM encourages the ASR model to assign posterior probabilities proportional to the linguistic probabilities of the proposed hypotheses." ; skos:prefLabel "LPM" . :LR-Net a skos:Concept ; rdfs:seeAlso ; skos:definition "An **LR-Net** is a type of non-convolutional neural network that utilises local relation layers instead of convolutions for image feature extraction. Otherwise, the architecture follows the same design as a [ResNet](https://paperswithcode.com/method/resnet)." ; skos:prefLabel "LR-Net" . :LRNet a skos:Concept ; dcterms:source ; skos:altLabel "Local Relation Network" ; skos:definition "The **Local Relation Network** (**LR-Net**) is a network built with local relation layers which represent a feature image extractor. This feature extractor adaptively determines aggregation weights based on the compositional relationship of local pixel pairs." ; skos:prefLabel "LRNet" . :LSGAN a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**LSGAN**, or **Least Squares GAN**, is a type of generative adversarial network that adopts the least squares loss function for the discriminator. Minimizing the objective function of LSGAN yields minimizing the Pearson $\\chi^{2}$ divergence. The objective function can be defined as:\r \r $$ \\min\\_{D}V\\_{LSGAN}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r \r $$ \\min\\_{G}V\\_{LSGAN}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r \r where $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.""" ; skos:prefLabel "LSGAN" . :LSHAttention a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Locality Sensitive Hashing Attention" ; skos:definition "**LSH Attention**, or **Locality Sensitive Hashing Attention** is a replacement for [dot-product attention](https://paperswithcode.com/method/scaled) with one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\\log L$), where $L$ is the length of the sequence. LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. It was proposed as part of the [Reformer](https://paperswithcode.com/method/reformer) architecture." ; skos:prefLabel "LSH Attention" . :LSUVInitialization a skos:Concept ; dcterms:source ; skos:altLabel "Layer-Sequential Unit-Variance Initialization" ; skos:definition """**Layer-Sequential Unit-Variance Initialization** (**LSUV**) is a simple method for weight initialization for deep net learning. The initialization strategy involves the following two step:\r \r 1) First, pre-initialize weights of each [convolution](https://paperswithcode.com/method/convolution) or inner-product layer with\r orthonormal matrices. \r \r 2) Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.""" ; skos:prefLabel "LSUV Initialization" . :LTLS a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Log-time and Log-space Extreme Classification" ; skos:definition "**LTLS** is a technique for multiclass and multilabel prediction that can perform training and inference in logarithmic time and space. LTLS embeds large classification problems into simple structured prediction problems and relies on efficient dynamic programming algorithms for inference. It tackles extreme multi-class and multi-label classification problems where the size $C$ of the output space is extremely large." ; skos:prefLabel "LTLS" . :LV-ViT a skos:Concept ; dcterms:source ; skos:definition """**LV-ViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses token labelling as a training objective. Different from the standard training\r objective of ViTs that computes the classification loss on an additional trainable class token, token labelling takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator.""" ; skos:prefLabel "LV-ViT" . :LXMERT a skos:Concept ; dcterms:source ; skos:altLabel "Learning Cross-Modality Encoder Representations from Transformers" ; skos:definition "LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships." ; skos:prefLabel "LXMERT" . :LabelQualityModel a skos:Concept ; dcterms:source ; skos:definition "**Label Quality Model** is an intermediate supervised task aimed at predicting the clean labels from noisy labels by leveraging rater features and a paired subset for supervision. The LQM technique assumes the existence of rater features and a subset of training data with both noisy and clean labels, which we call paired-subset. In real world scenarios, some level of label noise may be unavoidable. The LQM approach still works as long as the clean(er) label is less noisy than a label from a rater that is randomly selected from the pool, e.g., clean labels can be from either expert raters or aggregation of multiple raters. LQM is trained on the paired-subset using rater features and noisy label as input, and inferred on the entire training corpus. The output of LQM is used during model training as a more accurate alternative to the noisy labels." ; skos:prefLabel "Label Quality Model" . :LabelSmoothing a skos:Concept ; skos:definition """**Label Smoothing** is a regularization technique that introduces noise for the labels. This accounts for the fact that datasets may have mistakes in them, so maximizing the likelihood of $\\log{p}\\left(y\\mid{x}\\right)$ directly can be harmful. Assume for a small constant $\\epsilon$, the training set label $y$ is correct with probability $1-\\epsilon$ and incorrect otherwise. Label Smoothing regularizes a model based on a [softmax](https://paperswithcode.com/method/softmax) with $k$ output values by replacing the hard $0$ and $1$ classification targets with targets of $\\frac{\\epsilon}{k-1}$ and $1-\\epsilon$ respectively.\r \r Source: Deep Learning, Goodfellow et al\r \r Image Source: [When Does Label Smoothing Help?](https://arxiv.org/abs/1906.02629)""" ; skos:prefLabel "Label Smoothing" . :LambdaLayer a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**Lambda layers** are a building block for modeling long-range dependencies in data. They consist of long-range interactions between a query and a structured set of context elements at a reduced memory cost. Lambda layers transform each available context into a linear function, termed a lambda, which is then directly applied to the corresponding query. Whereas self-attention defines a similarity kernel between the query and the context elements, a lambda layer instead summarizes contextual information into a fixed-size linear function (i.e. a matrix), thus bypassing the need for memory-intensive attention maps." ; skos:prefLabel "Lambda Layer" . :LapEigen a skos:Concept ; dcterms:source ; skos:altLabel "Laplacian EigenMap" ; skos:definition "" ; skos:prefLabel "LapEigen" . :LapStyle a skos:Concept ; dcterms:source ; skos:altLabel "Laplacian Pyramid Network" ; skos:definition """**LapStyle**, or **Laplacian Pyramid Network**, is a feed-forward style transfer method. It uses a [Drafting Network](https://paperswithcode.com/method/drafting-network) to transfer global style patterns in low-resolution, and adopts higher resolution [Revision Networks](https://paperswithcode.com/method/revision-network) to revise local styles in a pyramid manner according to outputs of multi-level Laplacian filtering of the content image. Higher resolution details can be generated by stacking Revision Networks with multiple Laplacian pyramid levels. The final stylized image is obtained by aggregating outputs of all pyramid levels.\r \r Specifically, we first generate image pyramid $\\left\\(\\bar{x}\\_{c}, r\\_{c}\\right\\)$ from content image $x\\_{c}$ with the help of Laplacian filter. Rough low-resolution stylized image are then generated by the Drafting Network. Then the Revision Network generates stylized detail image in high resolution. Then the final stylized image is generated by aggregating the outputs pyramid. $L, C$ and $A$ in an image represent Laplacian, concatenate and aggregation operation separately.""" ; skos:prefLabel "LapStyle" . :LaplacianPE a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:altLabel "Laplacian Positional Encodings" ; skos:definition """[Laplacian eigenvectors](https://paperswithcode.com/paper/laplacian-eigenmaps-and-spectral-techniques) represent a natural generalization of the [Transformer](https://paperswithcode.com/method/transformer) positional encodings (PE) for graphs as the eigenvectors of a discrete line (NLP graph) are the cosine and sinusoidal functions. They help encode distance-aware information (i.e., nearby nodes have similar positional features and farther nodes have dissimilar positional features).\r \r Hence, Laplacian Positional Encoding (PE) is a general method to encode node positions in a graph. For each node, its Laplacian PE is the k smallest non-trivial eigenvectors.""" ; skos:prefLabel "Laplacian PE" . :LaplacianPyramid a skos:Concept ; skos:definition """A **Laplacian Pyramid** is a linear invertible image representation consisting of a set of band-pass\r images spaced an octave apart, plus a low-frequency residual. Formally, let $d\\left(.\\right)$ be a downsampling operation that blurs and decimates a $j \\times j$ image $I$ so that $d\\left(I\\right)$ is a new image of size $\\frac{j}{2} \\times \\frac{j}{2}$. Also, let $u\\left(.\\right)$ be an upsampling operator which smooths and expands $I$ to be twice the size, so $u\\left(I\\right)$ is a new image of size $2j \\times 2j$. We first build a Gaussian pyramid $G\\left(I\\right) = \\left[I\\_{0}, I\\_{1}, \\dots, I\\_{K}\\right]$, where\r $I\\_{0} = I$ and $I\\_{k}$ is $k$ repeated application of $d\\left(.\\right)$ to $I$. $K$ is the number of levels in the pyramid selected so that the final level has a minimal spatial extent ($\\leq 8 \\times 8$ pixels).\r \r The coefficients $h\\_{k}$ at each level $k$ of the Laplacian pyramid $L\\left(I\\right)$ are constructed by taking the difference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with $u\\left(.\\right)$ so that the sizes are compatible:\r \r $$ h\\_{k} = \\mathcal{L}\\_{k}\\left(I\\right) = G\\_{k}\\left(I\\right) − u\\left(G\\_{k+1}\\left(I\\right)\\right) = I\\_{k} − u\\left(I\\_{k+1}\\right) $$\r \r Intuitively, each level captures the image structure present at a particular scale. The final level of the\r Laplacian pyramid $h\\_{K}$ is not a difference image, but a low-frequency residual equal to the final\r Gaussian pyramid level, i.e. $h\\_{K} = I\\_{K}$. Reconstruction from a Laplacian pyramid coefficients\r $\\left[h\\_{1}, \\dots, h\\_{K}\\right]$ is performed using the backward recurrence:\r \r $$ I\\_{k} = u\\left(I\\_{k+1}\\right) + h\\_{k} $$\r \r which is started with $I\\_{K} = h\\_{K}$ and the reconstructed image being $I = I\\_{o}$. In other words, starting at the coarsest level, we repeatedly upsample and add the difference image h at the next finer level until we return to the full-resolution image.\r Source: [LAPGAN](https://paperswithcode.com/method/lapgan)\r \r Image : [Design of FIR Filters for Fast Multiscale Directional Filter Banks](https://www.researchgate.net/figure/Relationship-between-Gaussian-and-Laplacian-Pyramids_fig2_275038450)""" ; skos:prefLabel "Laplacian Pyramid" . :Large-scalespectralclustering a skos:Concept ; dcterms:source ; skos:definition """# [Spectral Clustering](https://paperswithcode.com/method/spectral-clustering)\r \r Spectral clustering aims to partition the data points into $k$ clusters using the spectrum of the graph Laplacians \r Given a dataset $X$ with $N$ data points, spectral clustering algorithm first constructs similarity matrix ${W}$, where ${w_{ij}}$ indicates the similarity between data points $x_i$ and $x_j$ via a similarity measure metric.\r \r Let $L=D-W$, where $L$ is called graph Laplacian and ${D}$ is a diagonal matrix with $d_{ii} = \\sum_ {j=1}^n w_{ij}$.\r The objective function of spectral clustering can be formulated based on the graph Laplacian as follow:\r \\begin{equation}\r \\label{eq:SC_obj}\r {\\max_{{U}} \\operatorname{tr}\\left({U}^{T} {L} {U}\\right)}, \\\\ {\\text { s.t. } \\quad {U}^{T} {{U}={I}}},\r \\end{equation}\r where $\\operatorname{tr(\\cdot)}$ denotes the trace norm of a matrix.\r The rows of matrix ${U}$ are the low dimensional embedding of the original data points.\r Generally, spectral clustering computes ${U}$ as the bottom $k$ eigenvectors of ${L}$, and finally applies $k$-means on ${U}$ to obtain the clustering results.\r \r \r # Large-scale Spectral Clustering\r \r To capture the relationship between all data points in $X$, an $N\\times N$ similarity matrix is needed to be constructed in conventional spectral clustering, which costs $O(N^2d)$ time and $O(N^2)$ memory and is not feasible for large-scale clustering tasks.\r Instead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as\r \\begin{equation}\r \\label{eq: cross-similarity}\r B = \\Phi(X,R),\r \\end{equation}\r where $R = \\{r_1,r_2,\\dots, r_p \\}$ ($p \\ll N$) is a set of landmarks with the same dimension to $X$, $\\Phi(\\cdot)$ indicate a similarity measure metric, and $B\\in \\mathbb{R}^{N\\times p}$ is the similarity sub-matrix to represent the $X \\in \\mathbb{R}^{N\\times d}$ with respect to the $R\\in \\mathbb{R}^{p\\times d}$.\r \r For large-scale spectral clustering using such similarity matrix,\r a symmetric similarity matrix $W$ can be designed as \r \\begin{equation}\r \\label{eq: WusedB }\r W=\\left[\\begin{array}{ll}\r \\mathbf{0} & B ; \\\\\r B^{T} & \\mathbf{0}\r \\end{array}\\right].\r \\end{equation}\r The size of matrix $W$ is $(N+p)\\times (N+p)$. \r Taking the advantage of the bipartite structure, some fast eigen-decomposition methods can then be used to obtain the spectral embedding.\r Finally, $k$-means is conducted on the embedding to obtain clustering results.\r \r The clustering result is directly related to the quality of $B$ that consists of the similarities between data points and landmarks.\r Thus, the performance of landmark selection is crucial to the clustering result.""" ; skos:prefLabel "Large-scale spectral clustering" . :LatentDiffusionModel a skos:Concept ; dcterms:source ; skos:definition "Diffusion models applied to latent spaces, which are normally built with (Variational) Autoencoders." ; skos:prefLabel "Latent Diffusion Model" . :LatentOptimisation a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """**Latent Optimisation** is a technique used for generative adversarial networks to refine the sample quality of $z$. Specifically, it exploits knowledge from the discriminator $D$ to refine the latent source $z$. Intuitively, the gradient $\\nabla\\_{z}f\\left(z\\right) = \\delta{f}\\left(z\\right)\\delta{z}$ points in the direction that better satisfies the discriminator $D$, which implies better samples. Therefore, instead of using the randomly sampled $z \\sim p\\left(z\\right)$, we uses the optimised latent:\r \r $$ \\Delta{z} = \\alpha\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}} $$\r \r $$ z' = z + \\Delta{z} $$\r \r Source: [LOGAN](https://paperswithcode.com/method/logan)\r .""" ; skos:prefLabel "Latent Optimisation" . :LayerDrop a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition "**LayerDrop** is a form of structured [dropout](https://paperswithcode.com/method/dropout) for [Transformer](https://paperswithcode.com/method/transformer) models which has a regularization effect during training and allows for efficient pruning at inference time. It randomly drops layers from the Transformer according to an \"every other\" strategy where pruning with a rate $p$ means dropping the layers at depth $d$ such that $d = 0\\left\\(\\text{mod}\\left(\\text{floor}\\left(\\frac{1}{p}\\right)\\right)\\right)$." ; skos:prefLabel "LayerDrop" . :LayerNormalization a skos:Concept ; dcterms:source ; rdfs:seeAlso ; skos:definition """Unlike [batch normalization](https://paperswithcode.com/method/batch-normalization), **Layer Normalization** directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks) and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with [Transformer](https://paperswithcode.com/methods/category/transformers) models.\r \r We compute the layer normalization statistics over all the hidden units in the same layer as follows:\r \r $$ \\mu^{l} = \\frac{1}{H}\\sum^{H}\\_{i=1}a\\_{i}^{l} $$\r \r $$ \\sigma^{l} = \\sqrt{\\frac{1}{H}\\sum^{H}\\_{i=1}\\left(a\\_{i}^{l}-\\mu^{l}\\right)^{2}} $$\r \r where $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\\mu$ and $\\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.""" ; skos:prefLabel "Layer Normalization" . :LayerScale a skos:Concept ; dcterms:source ; skos:definition """**LayerScale** is a method used for [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper high-capacity image transformers that benefit from depth.\r \r Specifically, LayerScale is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words:\r \r $$\r x\\_{l}^{\\prime} =x\\_{l}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}, \\ldots, \\lambda\\_{l, d}\\right) \\times \\operatorname{SA}\\left(\\eta\\left(x\\_{l}\\right)\\right) \r $$\r \r $$\r x\\_{l+1} =x\\_{l}^{\\prime}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}^{\\prime}, \\ldots, \\lambda\\_{l, d}^{\\prime}\\right) \\times \\operatorname{FFN}\\left(\\eta\\left(x\\_{l}^{\\prime}\\right)\\right)\r $$\r \r where the parameters $\\lambda\\_{l, i}$ and $\\lambda\\_{l, i}^{\\prime}$ are learnable weights. The diagonal values are all initialized to a fixed small value $\\varepsilon:$ we set it to $\\varepsilon=0.1$ until depth 18 , $\\varepsilon=10^{-5}$ for depth 24 and $\\varepsilon=10^{-6}$ for deeper networks. \r \r This formula is akin to other [normalization](https://paperswithcode.com/methods/category/normalization) strategies [ActNorm](https://paperswithcode.com/method/activation-normalization) or [LayerNorm](https://paperswithcode.com/method/layer-normalization) but executed on output of the residual block. Yet LayerScale seeks a different effect: [ActNorm](https://paperswithcode.com/method/activation-normalization) is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, like [BatchNorm](https://paperswithcode.com/method/batch-normalization). In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of [ReZero](https://paperswithcode.com/method/rezero), [SkipInit](https://paperswithcode.com/method/skipinit), [Fixup](https://paperswithcode.com/method/fixup-initialization) and [T-Fixup](https://paperswithcode.com/method/t-fixup): to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in [ReZero](https://paperswithcode.com/method/rezero)/[SkipInit](https://paperswithcode.com/method/skipinit), [Fixup](https://paperswithcode.com/method/fixup-initialization) and [T-Fixup](https://paperswithcode.com/method/t-fixup).""" ; skos:prefLabel "LayerScale" . :LayoutLMv2 a skos:Concept ; dcterms:source ; skos:definition """**LayoutLMv2** is an architecture and pre-training method for document understanding. The model is pre-trained with a great number of unlabeled scanned document images from the IIT-CDIP dataset, where some images in the text-image pairs are randomly replaced with another document image to make the model learn whether the image and OCR texts are correlated or not. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks.\r \r Specifically, an enhanced Transformer architecture is used, i.e. a multi-modal Transformer asisthe backbone of LayoutLMv2. The multi-modal Transformer accepts inputs of three modalities: text, image, and layout. The input of each modality is converted to an embedding sequence and fused by the encoder. The model establishes deep interactions within and between modalities by leveraging the powerful Transformer layers.""" ; skos:prefLabel "LayoutLMv2" . :LayoutReader a skos:Concept ; dcterms:source ; skos:definition """** LayoutReader** is a sequence-to-sequence model for reading order detection that uses both textual and layout information, where the layout-aware language model [LayoutLM](https://paperswithcode.com/method/layoutlmv2) is leveraged as an encoder. The generation step in the encoder-decoder structure tis modified to generate the reading order sequence.\r \r In the encoding stage, LayoutReader packs the pair of source and target segments into a contiguous input sequence of LayoutLM and carefully designs the [self-attention mask](https://paperswithcode.com/methods/category/factorized-attention) to control the visibility between tokens. As shown in the Figure, LayoutReader allows the tokens in the source segment to attend to each other while preventing the tokens in the target segment from attending to the rightward context. If 1 means allowing and 0 means preventing, the detail of the mask $M$ is as follows:\r \r $$ M\\_{i, j}= \\begin{cases}1, & \\text { if } i