Using Neural Networks for Reducing the Dimensions of Single-Cell RNA-Seq Data

While only recently developed, the ability to profile expression data in single cells (scRNA-Seq) has already led to several important studies and findings. However, this technology has also raised several new computational challenges. These include questions about the best methods for clustering scRNA-Seq data, how to identify unique group of cells in such experiments, and how to determine the state or function of specific cells based on their expression profile. To address these issues we develop and test a method based on neural networks (NN) for the analysis and retrieval of single cell RNA-Seq data. We tested various NN architectures, some of which incorporate prior biological knowledge, and used these to obtain a reduced dimension representation of the single cell expression data. We show that the NN method improves upon prior methods in both, the ability to correctly group cells in experiments not used in the training and the ability to correctly infer cell type or state by querying a database of tens of thousands of single cell profiles. Such database queries (which can be performed using our web server) will enable researchers to better characterize cells when analyzing heterogeneous scRNA-Seq samples. Read more about it here.

Reconstructing temporal progression of signaling pathways

A stimulant (like a virus) to a cell can interact with proteins in the cell and cause a cascade of signaling pathways that activate/repress the expression of genes downstream. The differential expression of such genes can cause further signaling cascades that cause the differential expression of another set of genes. While the gene expression at different time points can be observed, it is not possible to directly observe the signaling pathways responsible for their differential expression. We developed a tool, TimePath, to learn the progression of such signaling pathways across time based on time series gene expression data and protein-protein interaction data. Read more about it here. We applied the tool to the analysis of HIV-1 dementia which you can read about here.

Multitask learning of signaling and regulatory networks

We developed a tool, MT-SDREM, to jointly learn signaling pathways and regulatory networks using time series gene expression data from multiple related conditions, TF-gene interaction data, and a protein-protein signaling network. Our tool built on an existing tool called SDREM which learnt the signaling pathways and regulatory network for just a single condition. For joint learning, we shared information the following information :- (1) we ensured that the condition-specific networks learnt were consistent – i.e. the edge directions were the same for all networks (2) If a TF was predicted to regulate more than one condition, it’s prior for regulating any condition was increased. Validation with respect to RNAi screen hits, GO analysis and manual examination of the predicted signaling proteins demonstrated the advantage of joint inference. Read more about it here.

Heuristics for Mixed Integer Programming

We programmed a Mixed Integer programming solver from scratch (using existing LP solvers) and experimented with various node selection and branching heuristics with application to the Warehouse Location problem.

Large neighborhood search for Dial-a-ride problem

The Dial-a-ride problem involves a set of passengers that have to be picked up and dropped off at various locations within certain time windows. We developed an algorithm for this problem that was able to obtain close to optimal solutions much more quickly than the state of the art algorithms while still being able to prove if the problem was infeasible – something that most other algorithms did not do. Read more about it here

Clause Learning for Constraint Programs

Redundant clause learning is a popular technique used by SAT solvers to speed up search for a solution or proof of infeasibility for a boolean formula. We developed an algorithm to effectively learn clauses for a class of problems called multi-valued satisfiability problems which are a generalization of boolean formulas. Experimental results showed that our algorithm was several times faster than the standard technique of simply converting the multi-valued formula into its boolean equivalent. Read more about it here and here.

Solution Counting

Solution counting for integer programs with only binary variables is a #P-complete problem. The motivation for this project was that several good techniques for lower bounding the number of solutions for such problems had been developed but the only non-trivial upper bounding technique was extremely slow (in fact, in experiments with the benchmarks we were looking at, it always timed out). We developed an upper bounding technique based on dynamic programming that was able to obtain non-trivial upper bounds much faster for several problems. Read more about it here

Automated Failure Recovery

We examined a programming language abstraction called Stabilizers which had been developed for Concurrent ML for automatically recovering from failures in program execution. We wrote a Javascript library for doing the equivalent for web applications (a use case would be if the network connection suddenly died).