About this site
As a budding scientist, it is one of my core duties to communicate my results to experts and non-experts alike. Not all results are conference or paper-worthy, and preprint servers only go so far to help disseminate information.
I plan to use this site to discuss random, interesting things I learn, find, discover, disprove, or correct, be it negative results from my work, results from reproducing other people's work, my ongoing research interests, programming tricks, papers I read, and so forth.
I take information quality and validity very seriously. Anyone who finds incorrect information on this site is graciously invited to send me an email so that I may fix the offending post as soon as possible.
In addition, I'm always on the lookout for interesting resources relevant to this site's topics. Know of any nice blog, teaching resource or article? Send me an email.
Comments? Questions? Hatemail? You know what to do. The address is mail@ on this domain.
Ankhs are also called
the
Keys of Life
My interests
This site will consist of a mix of content I will post for myself, effectively for archival purposes, and assorted contents relevant to my interests, which are:
• Programming and Software;
• Bioinformatics (as a catchall for all computer + biology, including computational biology, cheminformatics, etc.);
• Biology, especially regenerative medicine;
• Cell state determination;
• Microscopy, especially atomic force.
The latter two are subject to change and represent my current academic interests.
About the author
I am currently a (hopefully) finishing PhD student in bioinformatics. During my PhD, I have developed a new peptide identification platform, the first such platform designed first for computational research (as opposed to a one-through analysis tool for a biological experiment).
I then developed and integrated deep learning models to improve identification rates beyond the apparent state of the art. I developed 3 main models:
• A novel model to show that it is possible to extract peptide statistics from mass spectra beyond what is usually done during peptide-spectrum match evaluation (which relies almost exclusively on correlation-like scoring between the experimental spectrum and a primitive theoretical spectrum generated from the sequence, its charge and its potential post-translational modifications);
• A model for full spectrum prediction (like PredFull, and unlike Prosit which only predicts the intensity for predetermined m/z from the aforementioned theoretical spectrum). It differs from PredFull in that it's simpler, shallower, performs better, and is specifically designed for integration in a search engine;
• A rescoring model using random forests and a pretrained synthetic peptide-based "true peptide" discriminator. Preliminary results show that it improves identification rates across 3 datasets by about 20% compared to Percolator at 1% false discovery rate.
Currently, I am busying myself with pushing my papers through, finishing my thesis text, and planning for what comes next...
Mumble mumble...
Having previously earned a Master's degree in deep learning, I had a brief stint in industry, where I worked on dialogue research.
After experiencing my first dose of office politics, unsatisfied with the lack of technical challenges and opportunities in the tech sector, and being young and naive, I jumped back into the viper's nest to learn something I had been pationate about since my teenage years.
For someone who claims to have a passion for bioinformatics, it may seem strange that my degree was in deep learning instead, or that I went to industry on dialogue systems at all. The reason is that I see computers as the most effective accelerator for research: well-designed computer tools can let a single biologist do the work of ten. Needless to say, when I discovered deep learning during my undergrad, I was fascinated by its potential for the same purpose.
Another major hurdle is the state of the intersection between computing and biology. There simply isn't enough collaboration both ways, resulting in very noisy data and bad quality software (it's neither group's fault: the computational group do their best to develop biological models, analysis software and prediction algorithms as they rely on poor data, while biologists don't quite yet grasp the core value proposition due to the limited quality of existing tools, thus continue to function in a biology-first workflow where bioinformatics is an after-thought).
The problem is the absence of a bridge between the computational and biology camps. A real bridge, not just throwing biologists and bioinformaticians on the same team and expecting things to work themselves out.
I had hoped the situation would improve while I was working on deep learning, providing an opportunity to jump in when I would feel confident in both what I can contribute, and the direction bioinformatics is headed. Unfortunately, I grew frustrated and impatient (with the tech industry) before such a thing could happen.
My primary objective during this PhD was to immerse myself in bioinformatics, provide deep learning expertise, and get a better feel of what can and cannot be done yet by applying data-first principles to biology. While I was objectively quite successful, this exercise only revealed to me how many obstacles remain to be conquered...
Should I be so lucky, I am aiming to found a lab which rectifies the situation. No more hodgepodge of "pipeline monkeys" and "lab techs". I will convert the world that computational methods come first: they direct how the data should be collected, can sanity check hypotheses before expending significant time, money and effort in wetlab experiments, and can analyse large amounts of data to provide hints as to which research direction may prove more fruitful. This means no more equating bioinformatics to a mere post-processing step over already completed work. But the other side (especially deep learning researchers) are doing the same thing on their end: running pointless computational experiments with no practical use, because they don't spend time to understand the problem. The data does not live in a vacuum, it is an artifact of a deliberate experiment, thought out for a specific purpose.
The biologist cannot analyse the mass data modern experiments generate withut robust computational tools. The bioinformatician cannot build robust computational tools without quality mass data from well-controlled experiments. It is time to wake up to this reality and to realise the workflow from 20, 30 years ago does not cut it anymore.
References
[Prosit] Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Siegfried Gessulat, Tobias Schmidt, Daniel Paul Zolg, Patroklos Samaras, Karsten Schnatbaum, Johannes Zerweck, Tobias Knaute, Julia Rechenberger, Bernard Delanghe, Andreas Huhmer, Ulf Reimer, Hans-Christian Ehrlich, Stephan Aiche, Bernhard Kuster & Mathias Wilhelm. Nature Methods volume 16, pages 509–518 (2019)
[PredFull] Full-Spectrum Prediction of Peptides Tandem Mass Spectra using Deep Neural Network. Kaiyuan Liu, Sujun Li, Lei Wang, Yuzhen Ye, and Haixu Tang. Anal. Chem. 2020, 92, 6, 4275–4283 (2020)
[Percolator] Semi-supervised learning for peptide identification from shotgun proteomics datasets. Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble & Michael J MacCoss. Nature Methods volume 4, pages 923–925 (2007)