{"title": "Estimating Mutual Information for Discrete-Continuous Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 5986, "page_last": 5997, "abstract": "Estimation of mutual information from observed samples is a basic primitive in machine learning, useful in several learning tasks including correlation mining, information bottleneck, Chow-Liu tree, and conditional independence testing in (causal) graphical models. While mutual information is a quantity well-defined for general probability spaces, estimators have been developed only in the special case of discrete or continuous pairs of random variables. Most of these estimators operate using the 3H -principle, i.e., by calculating the three (differential) entropies of X, Y and the pair (X,Y). However, in general mixture spaces, such individual entropies are not well defined, even though mutual information is. In this paper, we develop a novel estimator for estimating mutual information in discrete-continuous mixtures. We prove the consistency of this estimator theoretically as well as demonstrate its excellent empirical performance. This problem is relevant in a wide-array of applications, where some variables are discrete, some continuous, and others are a mixture between continuous and discrete components.", "full_text": "Estimating Mutual Information for\n\nDiscrete-Continuous Mixtures\n\nWeihao Gao\n\nDepartment of ECE\n\nCoordinated Science Laboratory\n\nUniversity of Illinois at Urbana-Champaign\n\nwgao9@illinois.edu\n\nSreeram Kannan\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nksreeram@uw.edu\n\nSewoong Oh\n\nDepartment of IESE\n\nPramod Viswanath\nDepartment of ECE\n\nCoordinated Science Laboratory\n\nCoordinated Science Laboratory\n\nUniversity of Illinois at Urbana-Champaign\n\nUniversity of Illinois at Urbana-Champaign\n\nswoh@illinois.edu\n\npramodv@illinois.edu\n\nAbstract\n\nEstimation of mutual information from observed samples is a basic primitive in\nmachine learning, useful in several learning tasks including correlation mining,\ninformation bottleneck, Chow-Liu tree, and conditional independence testing in\n(causal) graphical models. While mutual information is a quantity well-de\ufb01ned\nfor general probability spaces, estimators have been developed only in the special\ncase of discrete or continuous pairs of random variables. Most of these estimators\noperate using the 3H-principle, i.e., by calculating the three (differential) entropies\nof X, Y and the pair (X, Y ). However, in general mixture spaces, such individual\nentropies are not well de\ufb01ned, even though mutual information is. In this paper, we\ndevelop a novel estimator for estimating mutual information in discrete-continuous\nmixtures. We prove the consistency of this estimator theoretically as well as\ndemonstrate its excellent empirical performance. This problem is relevant in a\nwide-array of applications, where some variables are discrete, some continuous,\nand others are a mixture between continuous and discrete components.\n\n1\n\nIntroduction\n\nA fundamental quantity of interest in machine learning is mutual information (MI), which character-\nizes the shared information between a pair of random variables (X, Y ). MI obeys several intuitively\nappealing properties including the data-processing inequality, invariance under one-to-one transfor-\nmations and chain rule [10]. Therefore, mutual information is widely used in machine learning for\ncanonical tasks as classi\ufb01cation [35], clustering [32, 49, 8, 29] and feature selection [2, 13]. Mutual\ninformation also emerges as the \u201ccorrect\" quantity in several graphical model inference problems\n(e.g., the Chow-Liu tree [9] and conditional independence testing [6]). MI is also pervasively used\nin many data science application domains, such as sociology [40], computational biology [28], and\ncomputational neuroscience [41].\nAn important problem in any of these applications is to estimate mutual information effectively\nfrom samples. While mutual information has been the de facto measure of information in several\napplications for decades, the estimation of mutual information from samples remains an active\nresearch problem. Recently, there has been a resurgence of interest in entropy, relative entropy and\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmutual information estimators, on both the theoretical as well as practical fronts [46, 31, 44, 45, 22,\n19, 7, 15, 14, 17, 16].\nThe previous estimators focus on either of two cases \u2013 the data is either purely discrete or purely\ncontinuous. In these special cases, the mutual information can be calculated based on the three\n(differential) entropies of X, Y and (X, Y ). We term estimators based on this principle as 3H-\nestimators (since they estimate three entropy terms), and a majority of previous estimators fall under\nthis category [19, 16, 46].\nIn practical downstream applications, we often have to deal with a mixture of continuous and discrete\nrandom variables. Random variables can be mixed in several ways. First, one random variable can be\ndiscrete whereas the other is continuous. For example, we want to measure the strength of relationship\nbetween children\u2019s age and height, here age X is discrete and height Y is continuous. Secondly,\na single scalar random variable itself can be a mixture of discrete and continuous components.\nFor example, consider X taking a zero-in\ufb02ated-Gaussian distribution, which takes value 0 with\nprobability 0.1 and is a Poisson distribution with mean 10 with probability 0.9. This distribution has\nboth a discrete component as well as a component with density, and is a well-known model for gene\nexpression readouts [24, 37]. Finally, X and / or Y can be high dimensional vector, each of whose\ncomponents may be discrete, continuous or mixed.\nIn all of the aforementioned mixed cases, mutual information is well-de\ufb01ned through the Radon-\nNikodym derivative (see Section 2) but cannot be expressed as a function of the entropies or dif-\nferential entropies of the random variables. Crucially, entropy is not well de\ufb01ned when a single\nscalar random variable comprises of both discrete and continuous components, in which case, 3H\nestimators (the vast majority of prior art) cannot be directly employed. In this paper, we address this\nchallenge by proposing an estimator that can handle all these cases of mixture distributions. The\nestimator directly estimates the Radon-Nikodym derivative using the k-nearest neighbor distances\nfrom the samples; we prove (cid:96)2 consistency of the estimator and demonstrate its excellent practical\nperformance through a variety of experiments on both synthetic and real dataset. Most relevantly, it\nstrongly outperforms natural baselines of discretizing the mixed random variables (by quantization)\nor making it continuous by adding a small Gaussian noise.\nThe rest of the paper is organized as follows. In Section 2, we review the general de\ufb01nition of\nmutual information for Radon-Nikodym derivative and show that it is well-de\ufb01ned for all the cases of\nmixtures. In Section 3, we propose our estimator of mutual information for mixed random variables.\nIn Section 4, we prove that our estimator is (cid:96)2 consistent under certain technical assumptions and\nverify that the assumptions are satis\ufb01ed for most practical cases. Section 5 contains the results of our\ndetailed synthetic and real-world experiments testing the ef\ufb01cacy of the proposed estimator.\n\n2 Problem Formation\n\nIn this section, we de\ufb01ne mutual information for general distributions as follows (e.g., [39]).\nDe\ufb01nition 2.1. Let PXY be a probability measure on the space X \u00d7 Y, where X and Y are both\nEuclidean spaces. For any measurable set A \u2286 X and B \u2286 Y, de\ufb01ne PX (A) = PXY (A \u00d7 Y) and\nPY (B) = PXY (X \u00d7 B). Let PX PY be the product measure PX \u00d7 PY . Then the mutual information\nI(X; Y ) of PXY is de\ufb01ned as\n\n(cid:90)\n\nI(X; Y ) \u2261\n\nlog\n\ndPXY\ndPX PY\n\ndPXY ,\n\n(1)\n\nX\u00d7Y\n\nis the Radon-Nikodym derivative.\n\nwhere dPXY\ndPX PY\nWe prove that for any probability measure P on X \u00d7 Y, the joint measure PXY is absolutely\ncontinuous with respect to the product measure PX PY , hence mutual information is well-de\ufb01ned.\nSee Appendix ?? for the detailed proof. Notice that this general de\ufb01nition includes the following\ncases of mixtures: (1) X is discrete and Y is continuous (or vice versa); (2) X or Y has many\ncomponents each, where some components are discrete and some are continuous; (3) X or Y or their\njoint distribution is a mixture of continuous and discrete distributions.\n\n2\n\n\f3 Estimators of Mutual Information\n\nReview of prior work. The estimation problem is quite different depending on whether the underly-\ning distribution is discrete, continuous or mixed. As pointed out earlier, most existing estimators for\nmutual information are based on the 3H principle: they estimate the three entropy terms \ufb01rst. This\n3H principle can be applied only in the purely discrete or purely continuous case.\nDiscrete data: For entropy estimation of a discrete variable X, the straightforward approach to plug-in\nthe estimated probabilities \u02c6pX (x) into the formula for entropy has been shown to be suboptimal\n[33, 1]. Novel entropy estimators with sub-linear sample complexity have been proposed [48, 53, 19,\n21, 20, 23]. MI estimation can then be performed using the 3H principle, and such an approach is\nshown to be worst-case optimal for mutual-information estimation [19].\nContinuous data: There are several estimators for differential entropy of continuous random variables,\nwhich have been exploited in a 3H principle to calculate the mutual information [3]. One family of\nentropy estimators are based on kernel density estimators [34] followed by re-substitution estima-\ntion. An alternate family of entropy estimators is based on k-Nearest Neighbor (k-NN) estimates,\nbeginning with the pioneering work of Kozachenko and Leonenko [26] (the so-called KL estimator).\nRecent progress involves an inspired mixture of an ensemble of kernel and k-NN estimators [46, 4].\nExponential concentration bounds under certain conditions are in [43].\nMixed Random Variables: Since the entropies themselves may not be well de\ufb01ned for mixed random\nvariables, there is no direct way to apply the 3H principle. However, once the data is quantized, this\nprinciple can be applied in the discrete domain. That mutual information in arbitrary measure spaces\ncan indeed be computed as a maximum over quantization is a classical result [18, 36, 38]. However,\nthe choice of quantization is complicated and while some quantization schemes are known to be\nconsistent when there is a joint density[11], the mixed case is complex. Estimator of the average of\nRadon-Nikodym derivative has been studied in [50, 51]. Very recent work generalizing the ensemble\nentropy estimator when some components are discrete and others continuous is in [31].\nBeyond 3H estimation: In an inspired work [27] proposed a direct method for estimating mutual\ninformation (KSG estimator) when the variables have a joint density. The estimator starts with the\n3H estimator based on differential entropy estimates based on the k-NN estimates, and employs a\nheuristic to couple the estimates in order to improve the estimator. While the original paper did not\ncontain any theoretical proof, even of consistency, its excellent practical performance has encouraged\nwidespread adoption. Recent work [17] has established the consistency of this estimator along with its\nconvergence rate. Further, recent works [14, 16] involving a combination of kernel density estimators\nand k-NN methods have been proposed to further improve the KSG estimator. [42] extends the KSG\nestimator to the case when one variable is discrete and another is scalar continuous.\nNone of these works consider a case even if one of the components has a mixture of continuous and\ndiscrete distribution, let alone for general probability distributions. There are two generic options:\n(1) one can add small independent noise on each sample to break the multiple samples and apply\na continuous valued MI estimator (like KSG), or (2) quantize and apply discrete MI estimators but\nthe performance for high-dimensional case is poor. These form baselines to compare against in our\ndetailed simulations.\nMixed Regime. We \ufb01rst examine the behavior of other estimators in the mixed regime, before\nproceeding to develop our estimator. Let us consider the case when X is discrete (but real valued)\nand Y possesses a density. In this case, we will examine the consequence of using the 3H principle,\nwith differential entropy estimated by the K-nearest neighbors. To do this, \ufb01x a parameter k, that\ndetermines the number of neighbors and let \u03c1i,z denote the distance of the k-the nearest neighbor of\n\nz, where z = x or z = y or z = (x, y). Then(cid:98)I (N )\n(cid:32)\n\n(cid:33)\n\n(cid:32)\n\nN cx\u03c1d\n\ni,x\n\nlog\n\nk\n\n+ a(k)\n\n+\n\nN(cid:88)\n\ni=1\n\n1\nN\n\n3H (X; Y ) =\n\nN cy\u03c1d\ni,y\n\nlog\n\nk\n\n+ a(k)\n\nN(cid:88)\n\ni=1\n\n1\nN\n\n(cid:33)\n\n(cid:32)\n\n\u2212\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n(cid:33)\n\nlog\n\nN cxy\u03c1d\n\ni,xy\n\nk\n\n+ a(k)\n\nwhere \u03c8(\u00b7) is the digamma function and a(\u00b7) = log(\u00b7) \u2212 \u03c8(\u00b7). In the case that X is discrete and Y\nhas a density, I3H(X; Y ) = \u2212\u221e + a \u2212 b = \u2212\u221e, which is clearly wrong.\nThe basic idea of the KSG estimator is to ensure that the \u03c1 is the same for both x, y and (x, y) and the\ndifference is instead in the number of nearest neighbors. Let nx,i be the number of samples of Xi\u2019s\n\n3\n\n\fKSG \u2261 1\n\nN\n\nestimator is given by (cid:98)I (N )\n\n(cid:80)N\nwithin distance \u03c1i,xy and ny,i be the number of samples of Yi\u2019s within distance \u03c1i,xy. Then the KSG\ni=1 ( \u03c8(k) + log(N ) \u2212 log(nx,i + 1) \u2212 log(ny,i + 1) ) where\n\u03c8(\u00b7) is the digamma function.\nIn the case of X being discrete and Y being continuous, it turns out that the KSG estimator does not\nblow up (unlike the 3H estimator), since the distances do not go to zero. However, in the mixed case,\nthe estimator has a non-trivial bias due to discrete points and is no longer consistent.\nProposed Estimator. We propose the following estimator for general probability distributions,\ninspired by the KSG estimator. The intuition is as follows. Fist notice that MI is the average of the\nlogarithm of Radon-Nikodym derivative, so we compute the Radon-Nikodym derivative for each\nsample i and take the empirical average. The re-substitution estimator for MI is then given as follows:\n. The basic idea behind our estimate of the Radon-Nikodym\n\n(cid:98)I(X; Y ) \u2261 1\n\n(cid:16) dPXY\n\n(cid:80)n\n\n(cid:17)\n\ni=1 log\n\nn\n\nderivative at each sample point is as follows:\n\ndPX PY\n\n(xi,yi)\n\n\u2022 When the point is discrete (which can be detected by checking if the k-nearest neighbor\ndistance of data i is zero), then we can assert that data i is in a discrete component, and we\ncan use plug-in estimator for Radon-Nikodym derivative.\n\n\u2022 If the point is such that there is a joint density (locally),\nthe KSG estimator sug-\ngests a natural idea: \ufb01x the radius and estimate the Radon-Nikodym derivative by\n(\u03c8(k) + log(N ) \u2212 log(nx,i + 1) \u2212 log(ny,i + 1)).\n\u2022 If k-nearest neighbor distance is not zero, then it may be either purely continuous or mixed.\n\nBut we show below that the method for purely continuous is also applicable for mixed.\n\nPrecisely, let nx,i be the number of samples of Xi\u2019s within distance \u03c1i,xy and ny,i be the number of\nsamples of Yi\u2019s with in \u03c1i,xy. Denote \u02dcki by the number of tuples (Xi, Yi) within distance \u03c1i,xy. If\nthe k-NN distance is zero, which means that the sample (Xi, Yi) is a discrete point of the probability\nmeasure, we set k to \u02dcki, which is the number of samples that have the same value as (Xi, Yi).\nOtherwise we just keep \u02dcki as k. Our proposed estimator is described in detail in Algorithm 1.\n\nAlgorithm 1 Mixed Random Variable Mutual Information Estimator\n\nInput: {Xi, Yi}N\nParameter: k \u2208 Z+;\nfor i = 1 to N do\n\ni=1, where Xi \u2208 X and Yi \u2208 Y;\n\n\u03c1i,xy := the k smallest distance among [di,j := max{(cid:107)Xj \u2212 Xi(cid:107),(cid:107)Yj \u2212 Yi(cid:107)}, j (cid:54)= i];\nif \u03c1i,xy = 0 then\n\n\u02dcki := number of samples such that di,j = 0;\n\nelse\n\n\u02dcki := k;\n\nend if\nnx,i := number of samples such that (cid:107)Xj \u2212 Xi(cid:107) \u2264 \u03c1i,xy;\nny,i := number of samples such that (cid:107)Yj \u2212 Yi(cid:107) \u2264 \u03c1i,xy;\n\u03bei := \u03c8(\u02dcki) + log N \u2212 log(nx,i + 1) \u2212 log(ny,i + 1);\n\nend for\n\nOutput: (cid:98)I (N )(X; Y ) := 1\n\nN\n\n(cid:80)N\n\ni=1 \u03bei.\n\nWe note that our estimator recovers previous ideas in several canonical settings. If the underlying\ndistribution is discrete, the k-nearest neighbor distance \u03c1i,xy equals to 0 with high probability, then\nour estimator recovers the plug-in estimator. If the underlying distribution does not have probability\nmasses, then there are no multiple overlapping samples, so \u02dcki equals to k, our estimator recovers the\nKSG estimator. If X is discrete and Y is single-dimensional continuous and PX (x) > 0 for all x, for\nsuf\ufb01ciently large dataset, the k-nearest neighbors of sample (xi, yi) will be located on the same xi\nwith high probability. Therefore, our estimator recovers the discrete vs continuous estimator in [42].\n\n4\n\n\f4 Proof of Consistency\n\nWe show that under certain technical conditions on the joint probability measure, the proposed\nestimator is consistent. We begin with the following de\ufb01nitions.\n\nPXY (x, y, r) \u2261 PXY ({(a, b) \u2208 X \u00d7 Y : (cid:107)a \u2212 x(cid:107) \u2264 r,(cid:107)b \u2212 y(cid:107) \u2264 r} ) ,\n\nPX (x, r) \u2261 PX ({a \u2208 X : (cid:107)a \u2212 x(cid:107) \u2264 r} ) ,\nPY (y, r) \u2261 PY ({b \u2208 Y : (cid:107)b \u2212 y(cid:107) \u2264 r} ) .\n\n(2)\n(3)\n(4)\n\nTheorem 1. Suppose that\n\nPXY (x,y,r)\n\n1. k is chosen to be a function of N such that kN \u2192 \u221e and kN log N/N \u2192 0 as N \u2192 \u221e.\n2. The set of discrete points {(x, y) : PXY (x, y, 0) > 0} is \ufb01nite.\n3.\n4. X \u00d7Y can be decomposed into countable disjoint sets {Ei}\u221e\n\nPX (x,r)PY (y,r) converges to f (x, y) as r \u2192 0 and f (x, y) \u2264 C with probability 1.\n5. (cid:82)\n\ni=1 such that f (x, y) is uniformly\n\ncontinuous on each Ei.\n\n(cid:12)(cid:12) log f (x, y)(cid:12)(cid:12) dPXY < +\u221e.\n(cid:105)\nThen we have limN\u2192\u221e E(cid:104)(cid:98)I (N )(X; Y )\n\nX\u00d7Y\n\n= I(X; Y ) .\n\nNotice that Assumptions 2,3,4 are satis\ufb01ed whenever (1) the distribution is (\ufb01nitely) discrete; (2) the\ndistribution is continuous; (3) some dimensions are (countably) discrete and some dimensions are\ncontinuous; (4) a mixture of the previous cases. Most real world data can be covered by these cases.\nA sketch of the proof is below with the full proof in the supplementary material.\n\nProof. (Sketch) We start with an explicit form of the Radon-Nikodym derivative dPXY /(dPX PY ).\nLemma 4.1. Under Assumption 3 and 4 in Theorem 1, (dPXY /(dPX PY ))(x, y) = f (x, y) =\nlimr\u21920 PXY (x, y, r)/(PX (x, r)PY (y, r)).\n\nNotice that (cid:98)IN (X; Y ) = (1/N )(cid:80)N\nE[(cid:98)I (N )(X; Y )] = E[\u03be1]. Therefore, the bias can be written as:\n(cid:12)(cid:12)(cid:12) EXY [E [\u03be1|X, Y ]] \u2212\n(cid:12)(cid:12)(cid:12) =\n(cid:12)(cid:12)(cid:12) E[(cid:98)I (N )(X; Y )] \u2212 I(X; Y )\n(cid:90) (cid:12)(cid:12)(cid:12) E [\u03be1|X, Y ] \u2212 log f (X, Y )\n(cid:12)(cid:12)(cid:12) for every (x, y) \u2208 X \u00d7Y by dividing the domain\n\ni=1 \u03bei, where all \u03bei are identically distributed. Therefore,\n\n(cid:12)(cid:12)(cid:12) E [ \u03be1|X, Y ]\u2212log f (X, Y )\n(cid:83) \u21263 where\n\nNow we upper bound\ninto three parts as X \u00d7 Y = \u21261\n\n(cid:12)(cid:12)(cid:12) dPXY .\n\n(cid:83) \u21262\n\nlog f (X, Y )PXY\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)\n\n(5)\n\n\u2264\n\n\u2022 \u21261 = {(x, y) : f (x, y) = 0} ;\n\u2022 \u21262 = {(x, y) : f (x, y) > 0, PXY (x, y, 0) > 0} ;\n\u2022 \u21263 = {(x, y) : f (x, y) > 0, PXY (x, y, 0) = 0} .\n\nWe show that limN\u2192\u221e(cid:82)\n\n(cid:12)(cid:12)(cid:12) E [\u03be1|X, Y ]\u2212log f (X, Y )\n\n(cid:12)(cid:12)(cid:12) dPXY = 0 for each i \u2208 {1, 2, 3} separately.\n\n\u2126i\n\n\u2022 For (x, y) \u2208 \u21261, we will show that \u21261 has zero probability with respect to PXY , i.e.\n\nPXY (\u21261) = 0. Hence,(cid:82)\n\n(cid:12)(cid:12)(cid:12) E [\u03be1|X, Y ] \u2212 log f (X, Y )\n\n(cid:12)(cid:12)(cid:12) dPXY = 0.\n\n\u21261\n\n5\n\n\f\u2022 For (x, y) \u2208 \u21262, f (x, y) equals to PXY (x, y, 0)/PX (x, 0)PY (y, 0), so it can be viewed as\na discrete part. We will \ufb01rst show that the k-nearest neighbor distance \u03c1k,1 = 0 with high\nprobability. Then we will use the the number of samples on (x, y) as \u02dcki, and we will show\nthat the mean of estimate \u03be1 is closed to log f (x, y).\n\n\u2022 For (x, y) \u2208 \u21263, it can be viewed as a continuous part. We use the similar proof technique\n\nas [27] to prove that the mean of estimate \u03be1 is closed to log f (x, y).\n\nThe following theorem bounds the variance of the proposed estimator.\nTheorem 2. Assume in addition that\n\n6. (kN log N )2/N \u2192 0 as N \u2192 \u221e.\n\nThen we have\n\n(cid:104)(cid:98)I (N )(X; Y )\n\n(cid:105)\n\n= 0 .\n\n(6)\n\nN\u2192\u221e Var\nlim\n\nProof. (Sketch) We use the Efron-Stein inequality to bound the variance of the estimator. For simplic-\n\nity, let(cid:98)I (N )(Z) be the estimate based on original samples {Z1, Z2, . . . , ZN}, where Zi = (Xi, Yi),\nand(cid:98)I (N )(Z\\j) is the estimate from {Z1, . . . , Zj\u22121, Zj+1, . . . , ZN}. Then a certain version of Efron-\n(cid:12)(cid:12)(cid:12)(cid:17)2\n\n(cid:105) \u2264 2(cid:80)N\n\nsupZ1,...,ZN\n\n(cid:16)\n\nj=1\n\n.\n\nStein inequality states that: Var\nNow recall that\n\n(cid:12)(cid:12)(cid:12)(cid:98)I (N )(Z) \u2212(cid:98)I (N )(Z\\j)\n(cid:17)\n\n\u03c8(\u02dcki) + log N \u2212 log(nx,i + 1) \u2212 log(ny,i + 1)\n\n,\n\n(cid:104)(cid:98)I (N )(Z)\n(cid:16)\nN(cid:88)\nN(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:98)I (N )(Z) \u2212(cid:98)I (N )(Z\\j)\n\n\u03bei(Z) =\n\n1\nN\n\ni=1\n\ni=1\n\n(cid:98)I (N )(Z) =\n\n1\nN\n\nTherefore, we have\n\nsup\n\nZ1,...,ZN\n\n(cid:12)(cid:12)(cid:12) \u2264 1\n\nN\n\nsup\n\nZ1,...,ZN\n\nN(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12) \u03bei(Z) \u2212 \u03bei(Z\\j)\n\n(cid:12)(cid:12)(cid:12) .\n\n(7)\n\n(8)\n\nTo upper bound the difference | \u03bei(Z) \u2212 \u03bei(Z\\j)| created by eliminating sample Zj for different\n(cid:80)N\ni \u2019s we consider three different cases: (1) i = j; (2) \u03c1k,i = 0; (3) \u03c1k,i > 0, and conclude that\ni=1 | \u03bei(Z) \u2212 \u03bei(Z\\j)| \u2264 O(k log N ) for all Zi\u2019s. The detail of the case study is in Section. ?? in\n\nthe supplementary material. Plug it into Efron-Stein inequality, we obtain:\n\nVar\n\n(cid:104)(cid:98)I (N )(Z)\n(cid:32)\nN(cid:88)\n\n\u2264 2\n\nj=1\n\n1\nN\n\nsup\n\nZ1,...,ZN\n\n(cid:12)(cid:12)(cid:12)(cid:19)2\n\nj=1\n\nsup\n\nZ1,...,ZN\n\n(cid:18)\n(cid:105) \u2264 2\nN(cid:88)\n(cid:12)(cid:12)(cid:12) \u03bei(Z) \u2212 \u03bei(Z\\j)\nN(cid:88)\n(cid:105)\n(cid:104)(cid:98)I (N )(Z)\n\n= 0.\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:98)I (N )(Z) \u2212(cid:98)I (N )(Z\\j)\n(cid:12)(cid:12)(cid:12)(cid:33)2\n\nBy Assumption 6, we have limN\u2192\u221e Var\n\nCombining Theorem 1 and Theorem 2, we have the (cid:96)2 consistency of(cid:98)I (N )(X; Y ).\n\n= O((k log N )2/N ) .\n\n(9)\n\n5 Simulations\n\nWe evaluate the performance of our estimator in a variety of (synthetic and real-world) experiments.\nExperiment I. (X, Y ) is a mixture of one continuous distribution and one discrete distribution. The\ncontinuous distribution is jointly Gaussian with zero mean and covariance \u03a3 =\n, and\n\n(cid:18) 1\n\n(cid:19)\n\n0.9\n1\n\n0.9\n\n6\n\n\fY\n\nY\n\nX\n\nX\n\nFigure 1: Left: An example of samples from a mixture of continuous (blue) and discrete (red)\ndistributions, where red points denote multiple samples. Right: An example of samples from a\ndiscrete X and a continuous Y .\n\nthe discrete distribution is P (X = 1, Y = 1) = P (X = \u22121, Y = \u22121) = 0.45 and P (X = 1, Y =\n\u22121) = P (X = \u22121, 1) = 0.05. These two distributions are mixed with equal probability. The\nscatter plot of a set of samples from this distribution is shown in the left panel of Figure. 1, where\nthe red squares denote multiple samples from the discrete distribution. For all synthetic experiments,\nwe compare our proposed estimator with a (\ufb01xed) partitioning estimator, an adaptive partitioning\nestimator [11] implemented by [47], the KSG estimator [27] and noisy KSG estimator (by adding\nGaussian noise N (0, \u03c32I) on each sample to transform all mixed distributions into continuous one).\nWe plot the mean squared error versus number of samples in Figure 2. The mean squared error is\naveraged over 250 independent trials.\nThe KSG estimator is entirely misled by the discrete samples as expected. The noisy KSG estimator\nperforms better but the added noise causes the estimate to degrade. In this experiment, the estimate\nis less sensitive to the noise added and the line is indistinguishable with the line for KSG. The\npartitioning and adaptive partitioning method quantizes all samples, resulting in an extra quantization\nerror. Note that only the proposed estimator has error decreasing with the sample size.\nExperiment II. X is a discrete random variable and Y is a continuous random variable. X is\nuniformly distributed over integers {0, 1, . . . , m \u2212 1} and Y is uniformly distributed over the range\n[X, X + 2] for a given X. The ground truth I(X; Y ) = log(m) \u2212 (m \u2212 1) log(2)/m. We choose\nm = 5 and a scatter plot of a set of samples is in the right panel of Figure. 1. Notice that in this case\n(and the following experiments) our proposed estimator degenerates to KSG if the hyper parameter k\nis chosen the same, hence KSG is not plotted. In this experiment our proposed estimator outperforms\nother methods.\nExperiment III. Higher dimensional mixture. Let (X1, Y1) and (Y2, X2) have the same joint\ndistribution as in experiment II and independent of each other. We evaluate the mutual information\nbetween X = (X1, X2) and Y = (Y1, Y2). Then ground truth I(X; Y ) = 2(log(m) \u2212 (m \u2212\n1) log(2)/m). We also consider X = (X1, X2, X3) and Y = (Y1, Y2, Y3) where (X3, Y3) have the\nsame joint distribution as in experiment II and independent of (X1, Y1), (X2, Y2). The ground truth\nI(X; Y ) = 3(log(m) \u2212 (m \u2212 1) log(2)/m). The adaptive partitioning algorithm works only for\none-dimensional X and Y and is not compared here.\nWe can see that the performance of partitioning estimator is very bad because the number of partitions\ngrows exponentially with dimension. Proposed algorithm suffers less from the curse of dimensionality.\nFor the right \ufb01gure, noisy KSG method has smaller error, but we point out that it is unstable with\nrespect to the noise level added: as the noise level is varied from \u03c3 = 0.5 to \u03c3 = 0.7 and the\nperformance varies signi\ufb01cantly (far from convergence).\nExperiment IV. Zero-in\ufb02ated Poissonization. Here X \u223c Exp(1) is a standard exponential random\nvariable, and Y is zero-in\ufb02ated Poissonization of X, i.e., Y = 0 with probability p and Y \u223c\nPoisson(x) given X = x with probability 1 \u2212 p. Here the ground truth is I(X; Y ) = (1 \u2212\nk=1 log k \u00b7 2\u2212k) \u2248 (1 \u2212 p)0.3012, where \u03b3 is Euler-Mascheroni constant. We\nrepeat the experiment for no zero-in\ufb02ation (p = 0) and for p = 15%. We \ufb01nd that the proposed\n\np)(2 log 2 \u2212 \u03b3 \u2212(cid:80)\u221e\n\n7\n\n-3-2-1 0 1 2 3-3-2-1 0 1 2 3 0 1 2 3 4 5 6 0 1 2 3 4\festimator is comparable to adaptive partitioning for no zero-in\ufb02ation and outperforms others for 15%\nzero-in\ufb02ation.\n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\nn\na\ne\nm\n\nr\no\nr\nr\ne\n\nd\ne\nr\na\nu\nq\ns\n\nn\na\ne\nm\n\nr\no\nr\nr\ne\n\nd\ne\nr\na\nu\nq\ns\n\nn\na\ne\nm\n\nsample size\n\nsample size\n\nFigure 2: Mean squared error vs. sample size for synthetic experiments. Top row (left to right):\nExperiment I; Experiment II. Middle row (left to right): Experiment III for 4 dimensions and 6\ndimensions. Bottom row (left to right): Experiment IV for p = 0 and p = 15%.\n\nWe conclude that our proposed estimator is consistent for all these four experiments, and the mean\nsquared error is always the best or comparable to the best. Other estimators are either not consistent\nor have large mean squared error for at least one experiment.\nFeature Selection Task. Suppose there are a set of features modeled by independent random variables\n(X1, . . . , Xp) and the data Y depends on a subset of features {Xi}i\u2208S, where card(S) = q < p. We\nobserve the features (X1, . . . , Xp) and data Y and try to select which features are related to Y . In\nmany biological applications, some of the data is lost due to experimental reasons and set to 0; even\nthe available data is noisy. This setting naturally leads to a mixture of continuous and discrete parts\nwhich we model by supposing that the observation is \u02dcXi and \u02dcY , instead of Xi and Y . Here \u02dcXi and\n\u02dcY equals to 0 with probability \u03c3 and follows Poisson distribution parameterized by Xi or Y (which\ncorresponds to the noisy observation) with probability 1 \u2212 \u03c3.\nIn this experiment, (X1, . . . , X20) are i.i.d. standard exponential random variables and Y is simply\n(X1, . . . , X5). \u02dcXi equals to 0 with probability 0.15, and \u02dcXi \u223c Poisson(Xi) with probability 0.85.\n\u02dcYi equals to 0 with probability 0.15 and \u02dcYi \u223c Exp(Yi) with probability 0.85. Upon observing \u02dcXi\u2019s\n\n8\n\n1e-30.010.1110800160024003200ProposedPartitioningAdapt PartNoisy KSGKSG1e-51e-41e-30.010.11800160024003200ProposedPartitioningAdapt PartNoisy KSG1e-30.010.1110800160024003200ProposedPartitioningNoisy KSG0.1110800160024003200ProposedPartitioning(Small) Noisy KSG(Large) Noisy KSG1e-51e-41e-30.010.11800160024003200ProposedPartitioningAdapt PartNoisy KSG1e-51e-41e-30.010.11800160024003200ProposedPartitioningAdapt PartNoisy KSG\fand \u02dcY , we evaluate MIi = I( \u02dcXi; Y ) using different estimators, and select the features with top-r\nhighest mutual information. Since the underlying number of features is unknown, we iterate over all\nr \u2208 {0, . . . , p} and observe a receiver operating characteristic (ROC) curve, shown in left of Figure 3.\nCompared to partitioning, noisy KSG and KSG estimators, we conclude that our proposed estimator\noutperforms other estimators.\n\ne\nt\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\nC\nO\nR\nU\nA\n\nFalse Positive Rate\n\nLevel of Dropout\n\nFigure 3: Left: ROC curve for the feature selection task. Right: AUROC versus levels of dropout for\ngene regulatory network inference.\n\nGene regulatory network inference. Gene expressions form a rich source of data from which to\ninfer gene regulatory networks; it is now possible to sequence gene expression data from single cells\nusing a technology called single-cell RNA-sequencing [52]. However, this technology has a problem\ncalled dropout, which implies that sometimes, even when the gene is present it is not sequenced\n[25, 12]. While we tested our algorithm on real single-cell RNA-seq dataset, it is hard to establish\nthe ground truth on these datasets. Instead we resorted to a challenge dataset for reconstructing\nregulatory networks, called the DREAM5 challenge [30]. The simulated (insilico) version of this\ndataset contains gene expression for 20 genes with 660 data point containing various perturbations.\nThe goal is to reconstruct the true network between the various genes. We used mutual information\nas the test statistic in order to obtain AUROC for various methods. While the dataset did not have\nany dropouts, in order to simulate the effect of dropouts in real data, we simulated various levels of\ndropout and compared the AUROC (area under ROC) of different algorithms in the right of Figure 3\nwhere we \ufb01nd the proposed algorithm to outperform the competing ones.\n\nAcknowledgement\n\nThis work was partially supported by NSF grants CNS-1527754, CCF-1553452, CCF-1705007,\nCCF-1651236, CCF-1617745, CNS-1718270 and GOOGLE Faculty Research Award.\n\nReferences\n[1] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Suresh. Maximum likelihood\n\napproach for symmetric distribution property estimation.\n\n[2] R. Battiti. Using mutual information for selecting features in supervised neural net learning.\n\nNeural Networks, IEEE Transactions on, 5(4):537\u2013550, 1994.\n\n[3] Jan Beirlant, Edward J Dudewicz, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and Edward C Van der Meulen. Nonparametric\nInternational Journal of Mathematical and Statistical\n\nentropy estimation: An overview.\nSciences, 6(1):17\u201339, 1997.\n\n[4] Thomas B Berrett, Richard J Samworth, and Ming Yuan. Ef\ufb01cient multivariate entropy estima-\n\ntion via k-nearest neighbour distances. arXiv preprint arXiv:1606.00304, 2016.\n\n[5] G\u00e9rard Biau and Luc Devroye. Lectures on the nearest neighbor method. Springer, 2015.\n\n[6] Christopher M Bishop. Pattern recognition. Machine Learning, 128:1\u201358, 2006.\n\n9\n\n0.00.20.40.60.81.00.00.20.40.60.81.0ProposedPartitioningNoisy KSGKSG0.20.30.40.50.60.70.80.00.20.40.60.81.0ProposedPartitioningNoisy KSGKSG\f[7] Yuheng Bu, Shaofeng Zou, Yingbin Liang, and Venugopal V Veeravalli. Estimation of kl\ndivergence between large-alphabet distributions. In Information Theory (ISIT), 2016 IEEE\nInternational Symposium on, pages 1118\u20131122. IEEE, 2016.\n\n[8] C. Chan, A. Al-Bashabsheh, J. B. Ebrahimi, T. Kaced, and T. Liu. Multivariate mutual\ninformation inspired by secret-key agreement. Proceedings of the IEEE, 103(10):1883\u20131913,\n2015.\n\n[9] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees.\n\nIEEE transactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[10] T. M. Cover and J. A. Thomas. Information theory and statistics. Elements of Information\n\nTheory, pages 279\u2013335, 1991.\n\n[11] Georges A Darbellay and Igor Vajda. Estimation of the information by an adaptive partitioning\nof the observation space. IEEE Transactions on Information Theory, 45(4):1315\u20131321, 1999.\n\n[12] Greg Finak, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K Shalek,\nChloe K Slichter, Hannah W Miller, M Juliana McElrath, Martin Prlic, et al. Mast: a \ufb02exible\nstatistical framework for assessing transcriptional changes and characterizing heterogeneity in\nsingle-cell rna sequencing data. Genome biology, 16(1):278, 2015.\n\n[13] F. Fleuret. Fast binary feature selection with conditional mutual information. The Journal of\n\nMachine Learning Research, 5:1531\u20131555, 2004.\n\n[14] S. Gao, G Ver Steeg, and A. Galstyan. Estimating mutual information by local gaussian\n\napproximation. arXiv preprint arXiv:1508.00536, 2015.\n\n[15] Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Ef\ufb01cient estimation of mutual information\nfor strongly dependent variables. In Arti\ufb01cial Intelligence and Statistics, pages 277\u2013286, 2015.\n\n[16] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Breaking the bandwidth barrier: Geometri-\ncal adaptive entropy estimation. In Advances in Neural Information Processing Systems, pages\n2460\u20132468, 2016.\n\n[17] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying \ufb01xed k-nearest neighbor\ninformation estimators. In Information Theory (ISIT), 2017 IEEE International Symposium on,\npages 1267\u20131271. IEEE, 2017.\n\n[18] Izrail Moiseevich Gelfand and AM Yaglom. Calculation of the amount of information about a\nrandom function contained in another such function. American Mathematical Society Provi-\ndence, 1959.\n\n[19] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Adaptive estimation of shannon entropy. In\nInformation Theory (ISIT), 2015 IEEE International Symposium on, pages 1372\u20131376. IEEE,\n2015.\n\n[20] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions\n\nunder ell1 loss. IEEE Transactions on Information Theory, 61(11):6343\u20136354, 2015.\n\n[21] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of\nfunctionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835\u2013\n2885, 2015.\n\n[22] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Maximum likelihood estimation\nof functionals of discrete distributions. IEEE Transactions on Information Theory, 63(10):6774\u2013\n6798, 2017.\n\n[23] Jiantao Jiao, Kartik Venkat, and Tsachy Weissman. Non-asymptotic theory for the plug-in rule\n\nin functional estimation. available on arXiv, 2014.\n\n[24] Peter V Kharchenko, Lev Silberstein, and David T Scadden. Bayesian approach to single-cell\n\ndifferential expression analysis. Nature methods, 11(7):740\u2013742, 2014.\n\n10\n\n\f[25] Peter V Kharchenko, Lev Silberstein, and David T Scadden. Bayesian approach to single-cell\n\ndifferential expression analysis. Nature methods, 11(7):740\u2013742, 2014.\n\n[26] LF Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a random vector.\n\nProblemy Peredachi Informatsii, 23(2):9\u201316, 1987.\n\n[27] A. Kraskov, H. St\u00f6gbauer, and P. Grassberger. Estimating mutual information. Physical review\n\nE, 69(6):066138, 2004.\n\n[28] Smita Krishnaswamy, Matthew H Spitzer, Michael Mingueneau, Sean C Bendall, Oren Litvin,\nErica Stone, Dana Pe\u2019er, and Garry P Nolan. Conditional density-based analysis of t cell\nsignaling in single-cell data. Science, 346(6213):1250689, 2014.\n\n[29] Pan Li and Olgica Milenkovic. Inhomogoenous hypergraph clustering with applications. arXiv\n\npreprint arXiv:1709.01249, 2017.\n\n[30] Daniel Marbach, James C Costello, Robert K\u00fcffner, Nicole M Vega, Robert J Prill, Diogo M\nCamacho, Kyle R Allison, Manolis Kellis, James J Collins, Gustavo Stolovitzky, et al. Wisdom\nof crowds for robust gene network inference. Nature methods, 9(8):796\u2013804, 2012.\n\n[31] Kevin R Moon, Kumar Sricharan, and Alfred O Hero III. Ensemble estimation of mutual\n\ninformation. arXiv preprint arXiv:1701.08083, 2017.\n\n[32] A. C. M\u00fcller, S. Nowozin, and C. H. Lampert. Information theoretic clustering using minimum\n\nspanning trees. Springer, 2012.\n\n[33] Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):1191\u2013\n\n1253, 2003.\n\n[34] Liam Paninski and Masanao Yajima. Undersmoothed kernel entropy estimators. IEEE Transac-\n\ntions on Information Theory, 54(9):4384\u20134388, 2008.\n\n[35] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-\ndependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence,\nIEEE Transactions on, 27(8):1226\u20131238, 2005.\n\n[36] A Perez. Information theory with abstract alphabets. Theory of Probability and its Applications,\n\n4(1), 1959.\n\n[37] Emma Pierson and Christopher Yau. Zifa: Dimensionality reduction for zero-in\ufb02ated single-cell\n\ngene expression analysis. Genome biology, 16(1):241, 2015.\n\n[38] Mark S Pinsker. Information and information stability of random variables and processes. 1960.\n\n[39] Yury Polyanskiy and Yihong Wu. Strong data-processing inequalities for channels and bayesian\n\nnetworks. In Convexity and Concentration, pages 211\u2013249. Springer, 2017.\n\n[40] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean,\nPeter J Turnbaugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti. Detecting novel\nassociations in large data sets. science, 334(6062):1518\u20131524, 2011.\n\n[41] Fred Rieke. Spikes: exploring the neural code. MIT press, 1999.\n\n[42] B. C. Ross. Mutual information between discrete and continuous data sets. PloS one,\n\n9(2):e87357, 2014.\n\n[43] Shashank Singh and Barnab\u00e1s P\u00f3czos. Exponential concentration of a density functional\n\nestimator. In Advances in Neural Information Processing Systems, pages 3032\u20133040, 2014.\n\n[44] Shashank Singh and Barnab\u00e1s P\u00f3czos. Finite-sample analysis of \ufb01xed-k nearest neighbor\ndensity functional estimators. In Advances in Neural Information Processing Systems, pages\n1217\u20131225, 2016.\n\n[45] Shashank Singh and Barnab\u00e1s P\u00f8czos. Nonparanormal information estimation. arXiv preprint\n\narXiv:1702.07803, 2017.\n\n11\n\n\f[46] K. Sricharan, D. Wei, and A. O. Hero. Ensemble estimators for multivariate entropy estimation.\n\nInformation Theory, IEEE Transactions on, 59(7):4374\u20134388, 2013.\n\n[47] Zolt\u00e1n Szab\u00f3.\n\nInformation theoretical estimators toolbox. Journal of Machine Learning\n\nResearch, 15:283\u2013287, 2014.\n\n[48] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for\nentropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual\nACM symposium on Theory of computing, pages 685\u2013694. ACM, 2011.\n\n[49] G. Ver Steeg and A. Galstyan. Maximally informative hierarchical representations of high-\n\ndimensional data. stat, 1050:27, 2014.\n\n[50] Q. Wang, S. R. Kulkarni, and S. Verd\u00fa. Divergence estimation of continuous distributions based\non data-dependent partitions. Information Theory, IEEE Transactions on, 51(9):3064\u20133074,\n2005.\n\n[51] Q. Wang, S. R. Kulkarni, and S. Verd\u00fa. Divergence estimation for multidimensional densities\nvia-nearest-neighbor distances. Information Theory, IEEE Transactions on, 55(5):2392\u20132405,\n2009.\n\n[52] Angela R Wu, Norma F Neff, Tomer Kalisky, Piero Dalerba, Barbara Treutlein, Michael E\nRothenberg, Francis M Mburu, Gary L Mantalas, Sopheak Sim, Michael F Clarke, et al.\nQuantitative assessment of single-cell rna-sequencing methods. Nature methods, 11(1):41\u201346,\n2014.\n\n[53] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best\npolynomial approximation. IEEE Transactions on Information Theory, 62(6):3702\u20133720, 2016.\n\n12\n\n\f", "award": [], "sourceid": 3050, "authors": [{"given_name": "Weihao", "family_name": "Gao", "institution": "UIUC"}, {"given_name": "Sreeram", "family_name": "Kannan", "institution": "University of Washington"}, {"given_name": "Sewoong", "family_name": "Oh", "institution": "UIUC"}, {"given_name": "Pramod", "family_name": "Viswanath", "institution": "UIUC"}]}