We’re a computational genomics group at the Genome Institute of Singapore. Our main focus is the work with transcriptomics data, we’re developing algorithms for processing and analysis of 1,000s of samples.



Alternative splicing in ESCs (see Lu et al. (2013) Nat. Cell. Biol.)

The main focus of our group is the analysis of transcriptomics data. We’re developing algorithms for large scale data analysis, modeling of batch effects, and normalisation of technical biases. Our work aims at identifying alternative splicing events, retrotransposons, and novel RNAs that contribute to human diseases (See Lu et al. (2013) Nat. Cell. Biol., Lu et al (2014) Nat. Struct. Mol. Biol., Göke et al. (2015) Cell Stem Cell, Göke and Ng (2016) EMBO Reports). In collaboration with wet labs at GIS and worldwide we’re focusing on cancer and neurodegenerative disease models (see Lin et al (2016) Cell Reports, Jo et al (2016) Cell Stem Cell). Current research focuses on large scale data sets, we’re analysing thousands of samples from bulk tissues and single cells.


Retrotransposons and their contribution to the coding and non-coding transcriptome (see Göke et al (2016) EMBO Reports)

 Cell Identity and Cellular Heterogeneity


Conversion of embryonic stem cell states (see Chan et al. (2013) Cell Stem Cell)

Even though all cells of the human body essentially share the same genetic information, the cell types that form the organs and tissues appear very diverse and have distinct properties. Together with wet lab groups at the GIS we aim at understanding the molecular determinants of cellular identity during development and differentiation; and how cellular identity is impaired in disease. We investigate transcript diversity, alternative splicing, gene regulation, and epigenetics to identify key elements and mechanisms involved in maintenance and conversion of cellular identities. We have been working extensively with embryonic stem cells to understand how cell states and complex cellular systems can be induced. Our group has contributed to the discovery of naive human pluripotency (Chan et al (2013) Cell Stem Cell) and characterisation of the first human midbrain organoids (Jo et al (2016) Cell Stem Cell).

 Machine Learning

All information is encoded in the DNA. Using string kernels and support vector machines we learn models to predict funtional elements from the DNA sequence alone. Our group has developed the first mismatch string kernel using a statistical model that corrects for the expected sequence composition (Göke et al (2012) Bioinformatics). We are also applying machine learning techniques to clinical data to understand the power and limitations of genomics data for personalised medicine.


Machine learning with genomics data: classification of regulatory function from DNA sequences (left), and stratification of breast cancer patients in groups with different outcome using genomics data (right).