1 Introduction
Memory is a central tenet in the model of human intelligence and is crucial to longterm reasoning and planning. Of particular interest is the theory of complementary learning systems McClelland et al. (1995) which proposes that the brain employs two complementary systems to support the acquisition of complex behaviours: a hippocampal fastlearning system that records events as episodic memory, and a neocortical slow learning system that learns statistics across events as semantic memory. While the functional dichotomy of the complementary systems are wellestablished McClelland et al. (1995); Kumaran et al. (2016), it remains unclear whether they are bounded by different computational principles. In this work we introduce a model that bridges this gap by showing that the same statistical learning principle can be applied to the fast learning system through the construction of a hierarchical Bayesian memory.
While recent work has shown that using memory augmented neural networks can drastically improve the performance of generative models
(Wu et al., 2018a, b)(Weston et al., 2015), metalearning (Santoro et al., 2016), longterm planning (Graves et al., 2014, 2016)and sample efficiency in reinforcement learning
(Zhu et al., 2019), no model has been proposed to exploit the inherent multidimensionality of biological memory Reimann et al. (2017). Inspired by the traditional (computerscience) memory model of heap allocation (Figure 1Left), we propose a novel differentiable memory allocation scheme called Kanerva ++ (K++), that learns to compress an episode of samples, referred to by the set of pointers in Figure 1, into a latent multidimensional memory (Figure 1Right). The K++ model infers a key distribution as a proxy to the pointers (Marlow et al., 2008) and is able to embed similar samples to an overlapping latent representation space, thus enabling it to be more efficient on compressing input distributions. In this work, we focus on applying this novel memory allocation scheme to latent variable generative models, where we improve the memory model in the Kanerva Machine (Wu et al., 2018a, b).2 Related Work
Variational Autoencoders
: Variational autoencoders (VAEs)
(Kingma and Welling, 2014)are a fundamental part of the modern machine learning toolbox and have wide ranging applications from generative modeling
(Kingma and Welling, 2014; Kingma et al., 2016; Burda et al., 2016), learning graphs (Kipf and Welling, 2016), medical applications (Sedai et al., 2017; Zhao et al., 2019) and video analysis (Fan et al., 2020). As a latent variable model, VAEs infer an approximate posterior over a latent representation and can be used in downstream tasks such as control in reinforcement learning (Nair et al., 2018; Pritzel et al., 2017). VAEs maximize an evidence lower bound (ELBO), , of the logmarginal likelihood, . The produced variational approximation, , is typically called the encoder, while comes from the decoder. Methods that aim to improve these latent variable generative models typically fall into two different paradigms: learning more informative priors or leveraging novel decoders. While improved decoder models such as PixelVAE (Gulrajani et al., 2017) and PixelVAE++ (Sadeghi et al., 2019) drastically improve the performance of , they suffer from a phenomenon called posterior collapse (Lucas et al., 2019), where the decoder can become almost independent of the posterior sample, but still retains the ability to reconstruct the original sample by relying on its autoregressive property (Goyal et al., 2017a).In contrast, VampPrior (Tomczak and Welling, 2018), Associative Compression Networks (ACN) (Graves et al., 2018), VAEnCRP (Goyal et al., 2017b) and VLAE (Chen et al., 2017) tighten the variational bound by learning more informed priors. VLAE for example, uses a powerful autoregressive prior; VAEnCRP learns a nonparametric Chinese restaurant process prior and VampPrior learns a Gaussian mixture prior representing prototypical virtual samples. On the other hand, ACN takes a twostage approach: by clustering real samples in the space of the posterior; and by using these related samples as inputs to a learned prior, ACN provides an information theoretic alternative to improved code transmission. Our work falls into this latter paradigm: we parameterize a learned prior by reading from a common memory, built through a transformation of an episode of input samples.
Memory Models: Inspired by the associative nature of biological memory, the Hopfield network (Hopfield, 1982)
introduced the notion of contentaddressable memory, defined by a set of binary neurons coupled with a Hamiltonian and a dynamical update rule. Iterating the update rule minimizes the Hamiltonian, resulting in patterns being stored at different configurations
(Hopfield, 1982; Krotov and Hopfield, 2016). Writing in a Hopfield network, thus corresponds to finding weight configurations such that stored patterns become attractors via Hebbian rules (Hebb, 1949). This concept of memory was extended to a distributed, continuous setting in Kanerva (1988) and to a complex valued, holographic convolutional binding mechanism by Plate (1995). The central difference between associative memory models Hopfield (1982); Kanerva (1988) and holographic memory Plate (1995) is that the latter decouples the size of the memory from the input word size.Most recent work with memory augmented neural networks treat memory in a slotbased manner (closer to the associative memory paradigm), where each column of a memory matrix, , represents a single slot. Reading memory traces,
, entails using a vector of addressing weights,
, to attend to the appropriate column of ,. This paradigm of memory includes models such as the Neural Turing Machine (NTM)
(Graves et al., 2014), Differentiable Neural Computer (DNC) (Graves et al., 2016) ^{1}^{1}1While DNC is slot based, it should be noted that DNC reads rows rather than columns., Memory Networks (Weston et al., 2015), Generative Temporal Models with Memory (GTMM) Fraccaro et al. (2018), Variational Memory EncoderDecoder (VMED) Le et al. (2018), and Variational Memory Addressing (VMA) (Bornschein et al., 2017). VMA differs from GTMM, VMED, DNC, NTM and Memory Networks by taking a stochastic approach to discrete keyaddressing, instead of the deterministic approach of the latter models.Recently, the Kanerva Machine (KM) (Wu et al., 2018a) and its extension, the Dynamic Kanerva Machine (DKM) (Wu et al., 2018b), interpreted memory writes and reads as inference in a generative model, wherein memory is now treated as a distribution,
. Under this framework, memory reads and writes are recast as sampling or updating the memory posterior. The DKM model differs from the KM model by introducing a dynamical addressing rule that could be used throughout training. While providing an intuitive and theoretically sound bound on the data likelihood, the DKM model requires an inner optimization loop which entails solving an ordinary least squares (OLS) problem. Typical OLS solutions require a matrix inversion (
), preventing the model from scaling to large memory sizes. More recent work has focused on employing a product of smaller Kanerva memories (Marblestone et al., 2020) in an effort to minimize the computational cost of the matrix inversion. In contrast, we propose a simplified view of memory creation by treating memory writes as a deterministic process in a fully feedforward setting. Crucially, we also modify the read operand such that it uses localized subregions of the memory, providing an extra dimension of operation in comparison with the KM and DKM models. While the removal of memory stochasticity might be interpreted as reducing the representation power of the model, we empirically demonstrate through our experiments that our model performs better, trains quicker and is simpler to optimize. The choice of a deterministic memory is further reinforced by research in psychology, where human visual memory has been shown to change deterministically Gold et al. (2005); Spencer and Hund (2002); Hollingworth et al. (2013).3 Model
To better understand the K++ model, we examine each of the individual components to understand their role within the complete generative model. We begin by first deriving a conditional variational lower bound (Section 3.1), describing the optimization objective and probabilistic assumptions. We then describe the write operand (Section 3.3), the generative process (Section 3.4) and finally the read and iterative read operands (Section 3.5).
3.1 Preliminaries
K++ operates over an exchangeable episode (Aldous, 1985) of samples, , drawn from a dataset , as in the Kanerva Machine. Therefore, the ordering of the samples within the episode does not matter. This enables factoring the conditional, , over each of the individual samples: , given the memory, . Our objective in this work is to maximize the expected conditional loglikelihood as in (Bornschein et al., 2017; Wu et al., 2018a):
(1) 
As alluded to in Barber and Agakov (2004) and Wu et al. (2018a), this objective can be interpreted as maximizing the mutual information, , between the memory, , and the episode, , since and given that the entropy of the data, , is constant. In order to actualize Equation 1 we rely on a variational bound which we derive in the following section.
3.2 Variational Lower Bound
To efficiently read from the memory, , we introduce a set of latent variables corresponding to the addressing read heads, , and a set of latent variables corresponding to the readout from the memory, . Given these latent variables, we can decompose the conditional, , using the product rule and introduce variational approximations ^{2}^{2}2We use as our variational approximation instead of in DKM as it presents a more stable objective. We discuss this in more detail in Section 3.5. and via a multipybyone trick:
(2)  
(3)  
Equation 2 assumes that is independent from : . This decomposition results in Equation 3, which includes two KLdivergences against true (unknown) posteriors, and . We can then train the model by maximizing the evidence lower bound (ELBO), , to the true conditional, :
(4)  
The bound in Equation 4 is tight if and , however, it involves inferring the entire memory . This prevents us from decoupling the size of the memory from inference and scales the computation complexity based on the size of the memory. To alleviate this constraint, we assume a purely deterministic memory, , built by transforming the input episode, , via a deterministic encoder and memory transformation model, . We also assume that the regions of memory which are useful in reconstructing a sample, , can be summarized by a set of localized contiguous memory subblocks as described in Equation 5 below. The intuition here is that similar samples, , might occupy a disjoint part of the representation space and the decoder, , would need to read multiple regions to properly handle sample reconstruction. For example, the digit “3” might share part of the representation space with a “2” and another part with a “5”.
(5) 
in equation 5 represents a set of diracdelta memory subregions, determined by the addressing key, , and a spatial transformer () network Jaderberg et al. (2015), ^{3}^{3}3We provide a brief review of spatial transformers in Appendix A. Our final optimization objective, , is attained by approximating from Equation 4 with (Equation 5) and is summarized by the graphical model in 2 below.
3.3 Write Model
Writing to memory in the K++ model (Figure 3) entails encoding the input episode, , through the encoder, , pooling the representation over the episode and encoding the pooled representation with the memory writer, . In this work, we employ a Temporal Shift Module (TSM) (Lin et al., 2019), applied on a ResNet18 (He et al., 2016). TSM works by shifting feature maps of a twodimensional vision model in the temporal dimension in order to build richer representations of contextual features. In the case of K++, this allows the encoder to build a better representation of the memory by leveraging intermediary episode specific features. Using a TSM encoder over a standard convolutional stack improves the performance of both K++ and DKM, where the latter observes an improvement of 6.32 nats/image over the reported test conditional variational lower bound of 77.2 nats/image (Wu et al., 2018b) for the binarized Omniglot dataset. As far as we are aware, the application of a TSM encoder to memory models has not been explored and is a contribution of this work.
The memory writer model, , in Figure 3
, allows K++ to nonlinearly transform the pooled embedding to better summarize the episode. In addition to inferring the deterministic memory,
, we also project the nonpooled embedding, , through a key model, :(6) 
The reparameterized keys will be used to read sample specific memory traces, , from the full memory, . The memory traces, , are used in training through the learned prior, , from Equation 4 via the KL divergence, . This KL divergence constrains the optimization objective to keep the representation of the amortized approximate posterior, , (probabilistically) close to the memory readout representation of the learned prior, . In the generative setting, this constraint enables memory traces to be routed from the learned prior, , to the decoder, , in a similar manner to standard VAEs. We detail this process in the following section.
3.4 Sample Generation
The Kanerva++ model, like the original KM and DKM models, enables sample generation given an existing memory or set of memories. samples from the prior key distribution, , are used to parameterize the spatial transformer, , which indexes the deterministic memory, . The result of this differentiable indexing is a set of memory subregions, , which are used in the decoder,
, to generate synthetic samples. Reading samples in this manner forces the encoder to utilize memory subregions that are useful for reconstruction, as nonread memory regions receive zero gradients during backpropagation. This insight allows us to use the simple feedfoward write process described in Section
3.3, while still retaining the ability to produce locally contiguous block allocated memory.3.5 Read / Iterative Read Model
K++ involves two forms of reading (Figure 5): iterative reading and a simpler and more stable read model used for training. During training we actualize from Equation 4 using an amortized isotropicgaussian posterior that directly transforms the embedding of the episode, , using a learned neural network (Figure 2b). As mentioned in Section 3.5, the readout, , of the memory traces, , are encouraged to learn a meaningful structured representation through the memory readout KL divergence, , which attempts to minimize the (probabilistic) distance between and .
Kanerva memory models also possess the ability to gradually improve a sample through interative inference (Figure 2c), whereby noisy samples can be improved by leveraging contextual information stored in memory. This can be interpreted as approximating the posterior, , by marginalizing the approximate key distribution:
(7) 
where in Equation 7
uses a single sample Monte Carlo estimate, evaluated by reinfering the previous reconstruction,
, through the approximate key posterior. Each subsequent memory readout, , improves upon its previous representation by absorbing additional information from the memory.4 Experiments
We contrast K++ against stateoftheart memory conditional vision models and present empirical results in Table 1. Binarized datasets assume Bernoulli output distributions, while continuous values are modeled by a discretized mixture of logistics Salimans et al. (2017). As is standard in literature Burda et al. (2016); Sadeghi et al. (2019); Ma et al. (2018); Chen et al. (2017), we provide results for binarized MNIST and binarized Omniglot in nats/image and rescale the corresponding results to bits/dim for all other datasets. We describe the model architecture, the optimization procedure and the memory creation protocol in Appendix E and E.1. Finally, extra CelebA generations and test image reconstructions for all experiments are provided in Appendix B and Appendix D respectively.
Method 






VAE Kingma and Welling (2014)  87.86  104.75  5.84  6.3    
IWAE Burda et al. (2016)  85.32  103.38        
Improved decoders  
PixelVAE++ Sadeghi et al. (2019)  78.00      2.90    
MAE Ma et al. (2018)  77.98  89.09    2.95    
DRAW Gregor et al. (2015)  87.4  96.5    3.58    
MatNet Bachman (2016)  78.5  89.5    3.24    
Richer priors  
Ordered ACN Graves et al. (2018)  73.9      3.07    
VLAE Chen et al. (2017)  78.53  102.11    2.95    
VampPrior Tomczak and Welling (2018)  78.45  89.76        
Memory conditioned models  
VMA Bornschein et al. (2017)    103.6        
KM Wu et al. (2018a)    68.3    4.37    
DNC Graves et al. (2016)    100        
DKM Wu et al. (2018b)  75.3  77.2    4.79  2.75  
DKM w/TSM (our impl)  51.84  70.88  4.15  4.31  2.92  
Kanerva++ (ours)  41.58  66.24  3.40  3.28  2.88 
K++ presents stateoftheart results for memory conditioned binarized MNIST and binarized Omniglot, and presents competitive performance for Fashion MNIST, CIFAR10 and DMLab mazes. The performance gap on the continuous valued datasets can be explained by our use of a simple convolutional decoder, rather than the autoregressive decoders used in models such as PixelVAE Sadeghi et al. (2019). We leave the exploration of more powerful decoder models to future work and note that our model can be integrated with autoregressive decoders.
4.1 Iterative inference
One of the benefits of K++ is that it uses the memory to learn a more informed prior by condensing the information from an episode of samples. One might suspect that based on the dimensionality of the memory and the size of the read traces, the memory might only learn prototypical patterns, rather than a full amalgamation of the input episode. This presents a problem for generation, as described in Section 3.4 , and can be observed in the first column of Figure 6Left where the first generation from a random key appears blurry. To overcome this limitation, we rely on the iterative inference of Kanerva memory models (Wu et al., 2018a, b). By holding the memory, , fixed and repeatedly inferring the latents, we are able to cleanup the pattern by leveraging the contextual information contained within the memory ( 3.5). This is visualized in the proceeding columns of Figure 6Left, where we observe a slow but clear improvement in generation quality. This property of iterative inference is one of the central benefits of using a memory model over a tradition solution like a VAE. We also present results of iterative inference on more classical image noise distributions such as saltandpepper, speckle and Poisson noise in Figure 6Right. For each original noisy pattern (top rows) we provide the resultant final reconstruction after ten steps of cleanup. The proposed K++ is robust to input noise and is able to cleanup most of the patterns in a semantically meaningful way.
4.1.1 Image Generations
Typical VAEs use high dimensional isotropic Gaussian latent variables () Burda et al. (2016); Kingma and Welling (2014)
. A well known property of high dimensional Gaussian distributions is that most of their mass is concentrated on the surface area of a high dimensional ball. Perturbations to a sample in an area of valid density can easily move it to an invalid density region
(Arvanitidis et al., 2018; White, 2016), causing blurry or irregular generations. In the case of K++, since the key distribution, , is within a low dimensional space, local perturbations,, are likely in regions with high probability density. We visualize this form of generation in Figure
7 for DMLab Mazes, Omniglot and CelebA 64x64, as well as the more traditional random key generations, , in Figure 8.Interestingly, local key perturbations of a trained DMLab Maze K++ model induces resultant generations that provide a natural traversal of the maze as observed by scanning Figure 7Left, row by row, from left to right. In contrast, the random generations of the same task (Figure 8Left) present a more discontinuous set of generations. We see a similar effect for the Omniglot and CelebA datasets, but observe that the locality is instead tied to character or facial structure as shown in Figure 7Center and Figure 7Right. Finally, in contrast to VAE generations, K++ is able to generate sharper images of ImageNet32x32 as shown in Appendix C. Future work will investigate this form of locally perturbed generation through an MCMC lens.
4.2 Ablation: Is Block Allocated Spatial Memory Useful?
While Figure 7 demonstrates the advantage of having a low dimensional sampling distribution and Figure 6 demonstrates the benefit of iterative inference, it is unclear whether the performance benefit in Table 1 is achieved from the episodic training, model structure, optimization procedure or memory allocation scheme. To isolate the cause of the performance benefit, we simplify the write architecture from Section 3.3 as shown in Figure 9Left. In this scenario, we produce the learned memory readout, , via an equivalently sized dense model that projects the embedding, , while keeping all other aspects the same. We train both models five times with the exact same TSMResNet18 encoder, decoder, optimizer and learning rate scheduler. As shown in Figure 9Right, the test conditional variational lower bound of the K++ model is 20.6 nats/image better than the baseline model for the evaluated binarized Omniglot dataset. This confirms that the spatial, block allocated latent memory model proposed in this work is useful when working with image distributions. Future work will explore this dimension for other modalities such as audio and text.
4.3 Ablation: episode length (T) and memory read steps (K).
To further explore K++, we evaluate the sensitivity of the model to varying episode lengths (T) in Figure 10left and memory read steps (K) in Figure 10right using the binarized MNIST dataset. We train K++ five times (each) for episode lengths ranging from 5 to 64 and observe that the model performs within margin of error for increasing episode lengths, producing negative test conditional variational bounds within a 1std of nats/image. This suggests that for the specific dimensionality of memory () used in this experiment, K++ was able to successfully capture the semantics of the binarized MNIST distribution. We suspect that for larger datasets this relationship might not necessary hold and that the dimensionality of the memory should scale with the size of the dataset, but leave the prospect of such capacity analysis for future research.
While ablating the number of memory reads (K) in Figure 10right, we observe that the total test KLdivergence varies by 1std of 0.041 nats/image for a range of memory reads from 1 to 64. A lower KL divergence implies that the model is able to better fit the approximate posteriors and to their correspondings priors in Equation 4. It should however be noted that a lower KLdivergence does not necessary imply a better generative model Theis et al. (2016). While qualitatively inspecting the generated samples, we observed that K++ generated more semantically sound generations at lower memory read steps. We suspect that the difficulty of generating realistic samples increases with the number of disjoint reads and found that produces high quality results. We use this value for all experiments in this work.
5 Conclusion
In this work, we propose a novel block allocated memory in a generative model framework and demonstrate its stateoftheart performance on several memory conditional image generation tasks. We also show that stochasticity in lowdimensional spaces produces higher quality samples in comparison to highdimensional latents typically used in VAEs. Furthermore, perturbations to the lowdimensional key generate samples with high variations. Nonetheless, there are still many unanswered questions: would a hard attention based solution to differentiable indexing prove to be better than a spatial transformer? What is the optimal upper bound of window read regions based on the input distribution? Future work will hopefully be able to address these lingering issues and further improve generative memory models.
References
 Exchangeability and related topics. In École d’Été de Probabilités de SaintFlour XIII—1983, pp. 1–198. Cited by: §3.1.
 Latent space oddity: on the curvature of deep generative models. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: §4.1.1.
 An architecture for deep, hierarchical generative models. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4826–4834. External Links: Link Cited by: Table 1.
 Information maximization in noisy channels: a variational approach. In Advances in Neural Information Processing Systems, pp. 201–208. Cited by: §3.1.
 Variational memory addressing in generative models. In Advances in Neural Information Processing Systems, pp. 3920–3929. Cited by: §2, §3.1, Table 1.
 Importance weighted autoencoders. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2, §4.1.1, Table 1, §4.
 Variational lossy autoencoder. ICLR. Cited by: §2, Table 1, §4.

Video anomaly detection and localization via gaussian mixture fully convolutional variational autoencoder
. Computer Vision and Image Understanding, pp. 102920. Cited by: §2.  Generative temporal models with spatial memory for partially observed environments. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1544–1553. External Links: Link Cited by: §2.
 Visual memory decay is deterministic. Psychological Science 16 (10), pp. 769–774. Cited by: §2.
 Zforcing: training stochastic recurrent networks. In Advances in neural information processing systems, pp. 6713–6723. Cited by: §2.
 Nonparametric variational autoencoders for hierarchical representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5094–5102. Cited by: §2.

Accurate, large minibatch SGD: training imagenet in 1 hour
. CoRR abs/1706.02677. External Links: Link, 1706.02677 Cited by: Appendix E.  Associative compression networks for representation learning. arXiv preprint arXiv:1804.02476. Cited by: §2, Table 1.

Neural turing machines
. arXiv preprint arXiv:1410.5401. Cited by: §1, §2.  Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: §1, §2, Table 1.

DRAW: a recurrent neural network for image generation
. In International Conference on Machine Learning, pp. 1462–1471. Cited by: Table 1.  PixelVAE: A latent variable model for natural images. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 770–778. Cited by: §3.3.  The organization of behavior: a neuropsychological theory. J. Wiley; Chapman & Hall. Cited by: §2.
 Visual working memory modulates rapid eye movements to simple onset targets. Psychological science 24 (5), pp. 790–796. Cited by: §2.
 Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 79 (8), pp. 2554–2558. Cited by: §2.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, F. R. Bach and D. M. Blei (Eds.), JMLR Workshop and Conference Proceedings, Vol. 37, pp. 448–456. External Links: Link Cited by: Appendix E.
 Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: Appendix A, §3.2.
 Sparse distributed memory. MIT press. Cited by: §2.
 Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: Appendix E.
 Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §2.
 Autoencoding variational bayes. ICLR. Cited by: §2, §4.1.1, Table 1.
 Variational graph autoencoders. CoRR abs/1611.07308. External Links: Link, 1611.07308 Cited by: §2.
 Dense associative memory for pattern recognition. In Advances in neural information processing systems, pp. 1172–1180. Cited by: §2.
 What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences 20 (7), pp. 512–534. Cited by: §1.
 Variational memory encoderdecoder. In Advances in Neural Information Processing Systems, pp. 1508–1518. Cited by: §2.
 Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093. Cited by: §3.3.
 Evolving normalizationactivation layers. CoRR abs/2004.02967. External Links: Link, 2004.02967 Cited by: Appendix E.

SGDR: stochastic gradient descent with warm restarts
. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: Appendix E.  Understanding posterior collapse in generative latent variable models. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, External Links: Link Cited by: §2.
 MAE: mutual posteriordivergence regularization for variational autoencoders. In International Conference on Learning Representations, Cited by: Table 1, §4.
 Product kanerva machines: factorized bayesian memory. Bridging AI and Cognitive Science Workshop. ICLR. Cited by: §2.
 Parallel generationalcopying garbage collection with a blockstructured heap. In Proceedings of the 7th international symposium on Memory management, pp. 11–20. Cited by: Figure 1, §1.
 Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological review 102 (3), pp. 419. Cited by: Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory, §1.

Spectral normalization for generative adversarial networks
. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: Appendix E.  Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 9209–9220. External Links: Link Cited by: §2.
 Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, J. Fürnkranz and T. Joachims (Eds.), pp. 807–814. External Links: Link Cited by: Appendix E.
 Holographic reduced representations. IEEE Transactions on Neural networks 6 (3), pp. 623–641. Cited by: §2.
 Neural episodic control. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 2827–2836. External Links: Link Cited by: §2.

Searching for activation functions
. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Workshop Track Proceedings, External Links: Link Cited by: Appendix E.  Cliques of neurons bound into cavities provide a missing link between structure and function. Frontiers in Computational Neuroscience 11, pp. 48. External Links: Link, Document, ISSN 16625188 Cited by: §1.
 PixelVAE++: improved pixelvae with discrete prior. CoRR abs/1908.09948. External Links: Link, 1908.09948 Cited by: §2, Table 1, §4, §4.
 PixelCNN++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §4.
 Metalearning with memoryaugmented neural networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1842–1850. External Links: Link Cited by: §1.
 Semisupervised segmentation of optic cup in retinal fundus images using variational autoencoder. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 75–82. Cited by: §2.
 Superconvergence: very fast training of residual networks using large learning rates. CoRR abs/1708.07120. External Links: Link, 1708.07120 Cited by: Appendix E.
 Prototypes and particulars: geometric and experiencedependent spatial categories.. Journal of Experimental Psychology: General 131 (1), pp. 16. Cited by: §2.
 A note on the evaluation of generative models. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.3.

VAE with a vampprior.
In
International Conference on Artificial Intelligence and Statistics
, pp. 1214–1223. Cited by: §2, Table 1.  Memory networks. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §2.
 Sampling generative networks. arXiv preprint arXiv:1609.04468. Cited by: §4.1.1.
 The kanerva machine: a generative distributed memory. ICLR. Cited by: §1, §2, §3.1, §3.1, §4.1, Table 1.
 Learning attractor dynamics for generative memory. In Advances in Neural Information Processing Systems, pp. 9379–9388. Cited by: §1, §2, §3.3, §4.1, Table 1.
 Group normalization. Int. J. Comput. Vis. 128 (3), pp. 742–755. External Links: Link, Document Cited by: Appendix E.
 Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: Appendix E.
 Variational autoencoder with truncated mixture of gaussians for functional connectivity analysis. In International Conference on Information Processing in Medical Imaging, pp. 867–879. Cited by: §2.
 Episodic reinforcement learning with associative memory. In International Conference on Learning Representations, Cited by: §1.
Appendix A Spatial Transformer Review
Indexing a matrix, , is typically a nondifferentiable operation since it involves hard cropping around an index. Spatial transformers (Jaderberg et al., 2015) provide a solution to this problem by decoupling the problem into two differentiable operands:

Learn an affine transformation of coordinates.

Use a differntiable bilinear transformation.
The affine transformation of source coordinates, , to target coordinates, is defined as:
(8) 
Here, the affine transform, has three learnable scalars: which define a scaling and translation in and respectively. In the case of K++, these three scalars represent the components of the key sample, as shown in Equation 8. After transforming the coordinates (not to be confused with the actual data), spatial transformers learn a differentiable bilinear transform which can be interpreted as learning a differentiable mask that is elementwise multiplied by the original data, :
(9) 
Consider the following example where ; this parameterization differntiably extracts the region shown in Figure 11Right from Figure 11Left:
The range of values for is bound between , where the center of the image is .
Appendix B CelebA Generations
We present random key generations of CelebA 64x64, trained without center cropping in Figure 12.
Appendix C VAE vs. K++ ImageNet32x32 Generations
Appendix D Test Image Reconstructions
Appendix E Model Architecture & Training
Encoder: As mentioned in Section 3.3, we use a TSMResnet18 encoder with Batchnorm (Ioffe and Szegedy, 2015)
and ReLU activations
(Nair and Hinton, 2010) for all tasks. We apply a fractional shift of the feature maps by 0.125 as suggested by the authors.Decoder: Our decoder is a simple convtranspose network with EvoNormS0 (Liu et al., 2020)
interspliced between each layer. Evonorms0 is similar in stride to Groupnorm
(Wu and He, 2020) combined with the swish activation function (Ramachandran et al., 2018).Optimizer & LR scheule: We use LARS (You et al., 2017) coupled with ADAM (Kingma and Ba, 2014) and a onecycle (Smith and Topin, 2017) cosine learning rate schedule (Loshchilov and Hutter, 2017)
. A linear warmup of 10 epochs
(Goyal et al., 2017c) is also used for the schedule. A weight decay of is used on every parameter barring biases and the affine terms of batchnorm. Each task is trained for 500 or 1000 epochs depending on the size of the dataset.Dense models: All dense models such as our key network are simple three layer deep linear dense models with a latent dimension of 512 coupled with spectral normalization (Miyato et al., 2018).
Memory writer: uses a deep linear convtranspose decoder on the pooled embedding, with a base feature map projection size of 256 with a division by 2 per layer. We use a memory size of for all the experiments in this work.
Learned Prior: uses a convolutional encoder that stacks the read traces, , along the channel dimension and projects it to the dimensionality of .
In practice, we observed that K++ is about 2x as fast (wall clock) compared to our reimplementation of DKM. We mainly attribute this to not having to solve an inner OLS optimization loop for memory inference.
e.1 Memory creation protocol
The memory creation protocol of K++ is similar in stride to that of the DKM model, given the deterministic relaxations and addressing mechanism described in Sections 3.3 and 3.4. Each memory, , is a function of an episode of samples, . To efficiently optimize the conditional lower bound in Equation 4, we parallelize the learning objective using a set of minibatches, as is typical with the optimization of neural networks. As with the DKM model, K++ computes the train and test conditional evidence lower bounds in Table 1, by first inferring the memory, , from the input episode, followed by the read out procedure as described in Section 3.5.
Comments
There are no comments yet.