I am an Assistant Professor at Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur

Previous positions

Email: jithin (at) cse.iitk.ac.in

My central research interest is in solving real-world problems in data science with a network perspective. I mostly follow a multifaceted approach of problem solving: theoretical modeling, designing and analyzing low complexity algorithms, and verifying it on real-world data. My research focus on data mining algorithms for large networks with probabilistic guarantees, statistical modeling and inference on networks, and distributed techniques for analyzing big network matrices. I am also interested in application of machine learning to various inference problems on networks, and, in general, stochastic modeling of complex systems.

Peer-reviewed papers

• Sufficiently Informative and Relevant Features: An Information-theoretic and Fourier-based Characterization
Mohsen Heidari, Jithin K. Sreedharan, Gil Shamir, and Wojciech Szpankowski
Accepted to IEEE Transactions on Information Theory, 2022

A fundamental obstacle in learning information from data is the presence of nonlinear redundancies and dependencies in it. To address this, we propose a Fourier-based approach to extract relevant information in the supervised setting. We first develop a novel Fourier expansion for functions of correlated binary random variables. This is a generalization of the standard Fourier expansion on the Boolean cube beyond product probability spaces. We further extend our Fourier analysis to stochastic mappings. As an important application of this analysis, we investigate learning with feature subset selection. We reformulate this problem in the Fourier domain, and introduce a computationally efficient measure for selecting features. Bridging the Bayesian error rate with the Fourier coefficients, we demonstrate that the Fourier expansion provides a powerful tool to characterize nonlinear dependencies in the features-label relation. Via theoretical analysis, we show that our proposed measure finds provably asymptotically optimal feature subsets. Lastly, we present an algorithm based on our measure and verify our findings via numerical experiments on various datasets.

• Finding Relevant Information via a Discrete Fourier Expansion
Mohsen Heidari, Jithin K. Sreedharan, Gil Shamir, and Wojciech Szpankowski
International Conference on Machine Learning (ICML), 2021
[paper] [supplementary material] [code]

A fundamental obstacle in learning information from data is the presence of nonlinear redundancies and dependencies in it. To address this, we propose a Fourier-based approach to extract relevant information in the supervised setting. We first develop a novel Fourier expansion for functions of correlated binary random variables. This is a generalization of the standard Fourier expansion on the Boolean cube beyond product probability spaces. We further extend our Fourier analysis to stochastic mappings. As an important application of this analysis, we investigate learning with feature subset selection. We reformulate this problem in the Fourier domain, and introduce a computationally efficient measure for selecting features. Bridging the Bayesian error rate with the Fourier coefficients, we demonstrate that the Fourier expansion provides a powerful tool to characterize nonlinear dependencies in the features-label relation. Via theoretical analysis, we show that our proposed measure finds provably asymptotically optimal feature subsets. Lastly, we present an algorithm based on our measure and verify our findings via numerical experiments on various datasets.

• Information Sufficiency via Fourier Expansion
Mohsen Heidari, Jithin K. Sreedharan, Gil Shamir, and Wojciech Szpankowski
IEEE International Symposium on Information Theory (ISIT), 2021
[paper]

We take an information-theoretic approach to identify nonlinear feature redundancies in unsupervised learning. We define a subset of features as sufficiently-informative when the joint entropy of all the input features equals to that of the chosen subset. We argue that the rest of the features are redundant as all the accessible information about the data can be captured from sufficiently-informative features. Next, instead of directly estimating the entropy, we propose a Fourier-based characterization. For that, we develop a novel Fourier expansion on the Boolean cube incorporating correlated random variables. This generalization of the standard Fourier analysis is beyond product probability spaces. Based on our Fourier framework, we propose a measure of redundancy for features in the unsupervised settings. We then, consider a variant of this measure with a search algorithm to reduce its computational complexity as low as $$O(nd)$$ with $$n$$ being the number of samples and $$d$$ the number of features. Besides the theoretical justifications, we test our method on various real-world and synthetic datasets. Our numerical results demonstrate that the proposed method outperforms state-of-the-art feature selection techniques.

• Temporal Ordered Clustering in Dynamic Networks: Unsupervised and Semi-supervised Learning Algorithms
Krzysztof Turowski (co-primary author), Jithin K. Sreedharan (co-primary author), and Wojciech Szpankowski
[paper] [data and code]

• IEEE Transactions on Network Science and Engineering, 2021
• IEEE International Symposium on Information Theory (ISIT), 2020
In temporal ordered clustering, given a single snapshot of a dynamic network in which nodes arrive at distinct time instants, we aim at partitioning its nodes into $$K$$ ordered clusters $$\mathcal{C}_1 \prec \cdots \prec \mathcal{C}_K$$ such that for $$i \lt j$$, nodes in cluster $$\mathcal{C}_i$$ arrived before nodes in cluster $$\mathcal{C}_j$$, with $$K$$ being a data-driven parameter and not known upfront. Such a problem is of considerable significance in many applications ranging from tracking the expansion of fake news to mapping the spread of information. We first formulate our problem for a general dynamic graph, and propose an integer programming framework that finds the optimal clustering, represented as a strict partial order set, achieving the best precision (i.e., fraction of successfully ordered node pairs) for a fixed density (i.e., fraction of comparable node pairs). We then develop a sequential importance procedure and design unsupervised and semi-supervised algorithms to find temporal ordered clusters that efficiently approximate the optimal solution. To illustrate the techniques, we apply our methods to the vertex copying (duplication-divergence) model which exhibits some edge-case challenges in inferring the clusters as compared to other network models. Finally, we validate the performance of the proposed algorithms on synthetic and real-world networks.

• Revisiting Parameter Estimation in Biological Networks: Influence of Symmetries
Jithin K. Sreedharan (co-primary author), Krzysztof Turowski (co-primary author), and Wojciech Szpankowski
[paper] [data and code] [poster] [slides]

• IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020
• BioKDD 2019 - in conjection with SIGKDD 2019 (oral presentation)
• ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2019 (poster presentation)
Graph models often give us a deeper understanding of real-world networks. In the case of biological networks they help in predicting the evolution and history of biomolecule interactions, provided we map properly real networks into the corresponding graph models. In this paper, we show that for biological graph models many of the existing parameter estimation techniques overlook the critical property of graph symmetry (also known formally as graph automorphisms), thus the estimated parameters give statistically insignificant results with respect to the observed network. To demonstrate it and to develop accurate estimation procedures, we focus on the biologically inspired duplication-divergence model, and the up-to-date data of protein-protein interactions of six species including human and yeast. Using exact recurrence relations of some prominent graph properties, we devise a parameter estimation technique that provides right order of number of symmetries, and use phylogenetically old proteins as the choice of seed graph nodes. We also find that our results are consistent with the ones obtained from maximum likelihood estimation (MLE). However, the MLE approach is significantly slower than our methods in practice.

• Inferring Temporal Information from a Snapshot of a Dynamic Network
Jithin K. Sreedharan (co-primary author), Abram Magner (co-primary author), Ananth Grama, and Wojciech Szpankowski
Nature Scientific Reports 2019
[paper (with Supplementary Material)] [code] [brain network dataset]

The problem of reverse-engineering the evolution of a dynamic network, known broadly as network archaeology, is one of profound importance in diverse application domains. In analysis of infection spread, it reveals the spatial and temporal processes underlying infection. In analysis of biomolecular interaction networks (e.g., protein interaction networks), it reveals early molecules that are known to be differentially implicated in diseases. In economic networks, it reveals flow of capital and associated actors. Beyond these recognized applications, it provides analytical substrates for novel studies - for instance, on the structural and functional evolution of the human brain connectome. In this paper, we model, formulate, and rigorously analyze the problem of inferring the arrival order of nodes in a dynamic network from a single snapshot. We derive limits on solutions to the problem, present methods that approach this limit, and demonstrate the methods on a range of applications, from inferring the evolution of the human brain connectome to conventional citation and social networks, where ground truth is known.

• TIMES: Temporal Information Maximally Extracted from Structures
Abram Magner (co-primary author), Jithin K. Sreedharan (co-primary author), Ananth Grama, and Wojciech Szpankowski
ACM International Conference on World Wide Web (WWW) 2018, (acceptance rate: 14.8%)
[paper] [slides] [code]

Inferring the node arrival sequence from a snapshot of a dynamic network is an important problem, with applications ranging from identifying sources of contagion to flow of capital in financial transaction networks. Variants of this problem have received significant recent research attention, including results on infeasibility of solution for prior formulations. We present a new formulation of the problem that admits probabilistic solutions for broad classes of dynamic network models. Instantiating our framework for a preferential attachment model, we present effectively computable and practically tight bounds on the tradeoff curve between optimal achievable precision and density/recall. We also present efficient algorithms for partial recovery of node arrival orders and derive theoretical and empirical performance bounds on the precision and density/recall of our methods in comparison to the best possible. We validate our methods through experiments on both synthetic and real networks to show that their performance is robust to model changes, and that they yield excellent results in practice. We also demonstrate their utility in the context of a novel application in analysis of the human brain connectome to draw new insights into the functional and structural organization and evolution of the human brain.

• Recovery of Vertex Orderings in Dynamic Graphs
Abram Magner, Ananth Grama, Jithin K. Sreedharan, and Wojciech Szpankowski
IEEE International Symposium on Information Theory (ISIT) 2017
[paper]

Many networks in the real world are dynamic in nature: nodes enter, exit, and make and break connections with one another as time passes. Several random graph models of these networks are such that nodes have well-defined arrival times. It is natural to ask if, for a given random graph model, we can recover the arrival order of nodes, given information about the structure of the graph. In this work, we give a rigorous formulation of the problem in a statistical learning framework and tie its feasibility, for a broad class of models, to several sets of permutations associated with the symmetries of the random graph model and graphs generated by it. Moreover, we show how the same quantities are fundamental to the study of the information content of graph structures. We then apply our general results to the special cases of the Erd˝os-R´enyi and preferential attachment models to derive strong inapproximability results.

• ✱ Revisiting Random Walk based Sampling in Networks: Evasion of Burn-in Period and Frequent Regenerations,
(alphabetical order) Konstantin Avrachenkov, Vivek S. Borkar, Arun Kadavankandy and Jithin K. Sreedharan
Computational Social Networks, Springer, 2018
[paper]

In the framework of network sampling, random walk (RW) based estimation techniques provide many pragmatic solutions while uncovering the unknown network as little as possible. Despite several theoretical advances in this area, RW based sampling techniques usually make a strong assumption that the samples are in stationary regime, and hence are impelled to leave out the samples collected before the burn-in period. This work proposes two sampling schemes without burn-in constraint to estimate the average of an arbitrary function defined on the network nodes, for e.g. the average age of users in a social network. The central idea of the algorithms lies in exploiting regeneration of RWs at revisits to an aggregated super-node or to a set of nodes and in strategies to enhance the frequency of such regenerations either by contracting the graph or by making the hitting set larger. Our first algorithm, which is based on Reinforcement Learning (RL), takes advantage of the regeneration of RWs, and it uses stochastic approximation to derive an estimator. This method can be seen as intermediate between purely stochastic Markov Chain Monte Carlo iterations and deterministic relative value iterations. We study this method via simulations on real networks and observe that its trajectories are much more stable than those of standard random walk based estimation procedures, and its error performance is comparable to that of respondent driven sampling (RDS) which has a smaller asymptotic variance than many other estimators. The second algorithm, which we call the RT estimator, is a modified form of RDS that accommodates the idea of regeneration. Simulation studies show that the mean squared error of RT estimator decays much faster than that of RDS with time.

• ✱ Hamiltonian System Approach to Eigenvalue-Eigenvector Problem in Networks
(alphabetical order) Konstantin Avrachenkov, Philippe Jacquet and Jithin K. Sreedharan*
IEEE International Workshop on Multidimensional (nD) Systems (nDS) 2017
[paper]

Because of the significant increase in size and complexity of the networks, the distributed computation of eigenvalues and eigenvectors of graph matrices has become very challenging and yet it remains as important as before. In this paper we develop efficient distributed algorithms to detect, with higher resolution, closely situated eigenvalues and corresponding eigenvectors of symmetric graph matrices. We model the system of graph spectral computation as physical systems with Lagrangian and Hamiltonian dynamics. The spectrum of Laplacian matrix, in particular, is framed as a classical spring-mass system with Lagrangian dynamics. The spectrum of any general symmetric graph matrix turns out to have a simple connection with quantum systems and it can be thus formulated as a solution to a Schr\"odinger-type differential equation. Taking into account the higher resolution requirement in the spectrum computation and the related stability issues in the numerical solution of the underlying differential equation, we propose the application of symplectic integrators to the calculation of eigenspectrum. The effectiveness of the proposed techniques is demonstrated with numerical simulations on real-world networks of different sizes and complexities.

• ✱ Inference in OSNs via Lightweight Partial Crawls
(alphabetical order) Konstantin Avrachenkov, Bruno Ribeiro and Jithin K. Sreedharan*
ACM SIGMETRICS / PERFORMANCE 2016, (acceptance rate: 13.5%)
[paper] [slides] [code]

Are Online Social Network (OSN) A users more likely to form friendships with those with similar attributes? Do users at an OSN B score content more favorably than OSN C users? Such questions frequently arise in the context of Social Network Analysis (SNA) but often crawling an OSN network via its Application Programming Interface (API) is the only way to gather data from a third party. To date, these partial API crawls are the majority of public datasets and the synonym of lack of statistical guarantees in incomplete-data comparisons, severely limiting SNA research progress. Using regenerative properties of the random walks, we propose estimation techniques based on short crawls that have proven statistical guarantees. Moreover, our short crawls can be implemented in massively distributed algorithms. We also provide an adaptive crawler that makes our method parameter-free, significantly improving our statistical guarantees. We then derive the Bayesian approximation of the posterior of the estimates, and in addition, obtain an estimator for the expected value of node and edge statistics in an equivalent configuration model or Chung-Lu random graph model of the given network (where nodes are connected randomly) and use it as a basis for testing null hypotheses. The theoretical results are supported with simulations on a variety of real-world networks.

• ✱ Distributed Spectral Decomposition in Networks by Complex Diffusion and Quantum Random Walk
(alphabetical order) Konstantin Avrachenkov, Philippe Jacquet and Jithin K. Sreedharan*
IEEE International Conference on Computer Communication (INFOCOM) 2016, (acceptance rate: 18.25%)
[paper] [slides] [poster (Bell Labs Future days, Paris)]

In this paper we address the problem of finding top $k$ eigenvalues and corresponding eigenvectors of symmetric graph matrices in networks in a distributed way. We propose a novel idea called complex power iterations in order to decompose the eigenvalues and eigenvectors at node level, analogous to time-frequency analysis in signal processing. At each node, eigenvalues correspond to the frequencies of spectral peaks and respective eigenvector components are the amplitudes at those points. Based on complex power iterations and motivated from fluid diffusion processes in networks, we devise distributed algorithms with different orders of approximation. We also introduce a Monte Carlo technique with gossiping which substantially reduces the computational overhead. An equivalent parallel random walk algorithm is also presented. We validate the algorithms with simulations on real-world networks. Our formulation of the spectral decomposition can be easily adapted to a simple algorithm based on quantum random walks. With the advent of quantum computing, the proposed quantum algorithm will be extremely useful.

• ✱ Comparison of random walk based techniques for estimating network averages
(alphabetical order) Konstantin Avrachenkov, Vivek S. Borkar, Arun Kadavankandy and and Jithin K. Sreedharan*
5th International Conference on Computational Social Networks (CSoNet) 2016
[paper] [slides] [code]

Function estimation on Online Social Networks (OSN) is an important field of study in complex network analysis. An efficient way to do function estimation on large networks is to use random walks. We can then defer to the extensive theory of Markov chains to do error analysis of these estimators. In this work we compare two existing techniques, Metropolis-Hastings MCMC and Respondent-Driven Sampling, that use random walks to do function estimation and compare them with a new reinforcement learning based technique. We provide both theoretical and empirical analyses for the estimators we consider.

• ✱ Distribution and Dependence of Extremes in Network Sampling Processes
(alphabetical order) Konstantin Avrachenkov, Natalia M. Markovich and Jithin K. Sreedharan*
[paper] [slides]

• Computational Social Networks, Springer, 2015
• Third International IEEE Workshop on Complex Networks and their Applications, Nov 2014
We explore the dependence structure in the sampled sequence of complex networks. We consider randomized algorithms to sample the nodes and study extremal properties in any associated stationary sequence of characteristics of interest like node degrees, number of followers or income of the nodes in Online Social Networks etc, which satisfy two mixing conditions. Several useful extremes of the sampled sequence like $k$th largest value, clusters of exceedances over a threshold, first hitting time of a large value etc are investigated. We abstract the dependence and the statistics of extremes into a single parameter that appears in Extreme Value Theory, called Extremal Index (EI). In this work, we derive this parameter analytically and also estimate it empirically. We propose the use of EI as a parameter to compare different sampling procedures. As a specific example, degree correlations between neighboring nodes are studied in detail with three prominent random walks as sampling techniques.

• Spectrum sensing using distributed sequential detection via noisy reporting MAC
Jithin K. Sreedharan and Vinod Sharma
Signal Processing (Elsevier), Jan 2015
[paper]

This paper considers cooperative spectrum sensing algorithms for Cognitive Radios which focus on reducing the number of samples to make a reliable detection. We propose algorithms based on decentralized sequential hypothesis testing in which the Cognitive Radios sequentially collect the observations, make local decisions and send them to the fusion center for further processing to make a final decision on spectrum usage. The reporting channel between the Cognitive Radios and the fusion center is assumed more realistically as a Multiple Access Channel (MAC) with receiver noise. Furthermore the communication for reporting is limited, thereby reducing the communication cost. We start with an algorithm where the fusion center uses an SPRT-like (Sequential Probability Ratio Test) procedure and theoretically analyze its performance. Asymptotically, its performance is close to the optimal centralized test without fusion center noise. We further modify this algorithm to improve its performance at practical operating points. Later we generalize these algorithms to handle uncertainties in SNR and fading.

• Nonparametric distributed sequential detection via universal source coding
Jithin K. Sreedharan and Vinod Sharma
Information Theory and Applications Workshop (ITA), Feb 2013
[paper]

We consider nonparametric or universal sequential hypothesis testing when the distribution under the null hypothesis is fully known but the alternate hypothesis corresponds to some other unknown distribution. These algorithms are primarily motivated from spectrum sensing in Cognitive Radios and intruder detection in wireless sensor networks. We use easily implementable universal lossless source codes to propose simple algorithms for such a setup. The algorithms are first proposed for discrete alphabet. Their performance and asymptotic properties are studied theoretically. Later these are extended to continuous alphabets. Their performance with two well known universal source codes, Lempel-Ziv code and KT-estimator with Arithmetic Encoder are compared. These algorithms are also compared with the tests using various other nonparametric estimators. Finally a decentralized version utilizing spatial diversity is also proposed and analysed.

• Spectrum Sensing via Universal Source Coding
Jithin K. Sreedharan and Vinod Sharma
IEEE Global Communications Conference (GLOBECOM), Dec 2012
[paper]

We consider nonparametric sequential hypothesis testing when the distribution under null hypothesis is fully known and the alternate hypothesis corresponds to some other unknown distribution. We use easily implementable universal lossless source codes to propose simple algorithms for such a setup. These algorithms are motivated from spectrum sensing application in Cognitive Radios. Universal sequential hypothesis testing using Lempel Ziv codes and Krichevsky-Trofimov estimator with Arithmetic Encoder are considered and compared for different distributions. Cooperative spectrum sensing with multiple Cognitive Radios using universal codes is also considered.

• Novel algorithms for distributed sequential hypothesis testing
K. S. Jithin and Vinod Sharma
49th Annual Allerton Conference on Communication, Control and Computing, Sep 2011
[paper]

This paper considers sequential hypothesis testing in a decentralized framework. We start with two simple decentralized sequential hypothesis testing algorithms. One of which is later proved to be asymptotically Bayes optimal. We also consider composite versions of decentralized sequential hypothesis testing. A novel nonparametric version for decentralized sequential hypothesis testing using universal source coding theory is developed. Finally we design a simple decentralized multihypothesis sequential detection algorithm.

• A novel algorithm for cooperative distributed sequential spectrum sensing in Cognitive Radio
Jithin K. Sreedharan and Vinod Sharma
IEEE Wireless Communications and Networking Conference (WCNC), Mar 2011.
[paper]

This paper considers cooperative spectrum sensing in Cognitive Radios. In our previous work we have developed DualSPRT, a distributed algorithm for cooperative spectrum sensing using Sequential Probability Ratio Test (SPRT) at the Cognitive Radios as well as at the fusion center. This algorithm works well, but is not optimal. In this paper we propose an improved algorithm- SPRT-CSPRT, which is motivated from Cumulative Sum Procedures (CUSUM). We analyse it theoretically. We also modify this algorithm to handle uncertainties in SNR’s and fading.

• Cooperative distributed sequential spectrum sensing
K. S. Jithin, Vinod Sharma, and Raghav Gopalarathnam
IEEE National Conference on Communication (NCC), Jan 2011
[paper]

We consider cooperative spectrum sensing for cognitive radios. We develop an energy efficient detector with low detection delay using sequential hypothesis testing. Sequential Probability Ratio Test (SPRT) is used at both the local nodes and the fusion center. We also analyse the performance of this algorithm and compare with the simulations. Modelling uncertainties in the distribution parameters are considered. Slow fading with and without perfect channel state information at the cognitive radios is taken into account.

• Ph.D. in Computer Science
Institut national de recherche en informatique et en automatique (INRIA), INRIA-Bell Labs joint lab, and Univ. of Nice Sophia Antipolis, France
Thesis title: Sampling and Inference in Complex Networks
Thesis supervisor: Dr. Konstantin Avrachenkov (INRIA, France)
Thesis jury: Reviewers - Prof. Don Towsley and Prof. Nelly Litvak, Examinators - Dr. Alain Jean-Marie and Dr. Philippe Jacquet
[abstract] [thesis] [slides] [report of reviewers]

• M.Sc. (Engg.)
Dept. of Electrical Communication Engineering, Indian Institute of Science (IISc), Bangalore, India.
Thesis title: Spectrum Sensing in Cognitive Radios using Distributed Sequential Detection
Thesis supervisor: Prof. Vinod Sharma
External reviewer: Prof. Shankar Prakriya
[abstract] [thesis] [slides] [report of reviewer]
(Best MS thesis medal from the Dept. of Electrical Communication Engineering, IISc)