When reading about deep learning, I found the word2vec manuscript awesome. Its a relatively simple concept to transform a word by its context into a vector representation, but I was amazed that the mathematical distance between these vectors actually turned out to keep actual meaning.
So, great. Now can we all agree its impressive that computers can learn how France->Paris is like Italy->Rome. but how useful is it if we give it a brief shot on medical genetic data?
I decided to make use of the NCBI OMIM database as a text corpus to build the word2vec model.
OMIM: Online Mendelian Inheritance In Man
OMIM is a comprehensive, autorative compendium of human genes and genetic phenotypes that is freely available. Authored and edited by Institute of Genetic Medicine, Johns Hopkins University School of Medicine
Willing to be productive as quick as possible, I decided to work with deeplearning4j as I am familiar with Java for the last 10 years. And I am pretty fond of spring-boot these days, so I could easily share the outcome of this experiment as a service in the future.
I first got up to speed with deeplearning4j by their tutorials on their home page, more specifically the one about word2vec.
Ok, so I downloaded the whole omim database, which is a 178MB txt file that I made available as a Java ClassPathResource which is fed into a SentenceIterator from deeplearning4j.
-rw-r--r-- 1 kennyhelsens staff 178M May 28 15:00 omim.txt
Next, just following the instructions from deeplearning4j, I decided to make use of the default TokenizerFactory, and we’re already fine to give it a first shot with a minimum configuration. (I’m running this during my daily train ride on my 3 year old macbook pro.)
We set minWordFrequency to 20 to leave out very sparse words.
I’ve decreased iterations and increased the minLearningRate a little bit to make faster progress. I don’t intend to write an academic paper here, just looking for some low hanging fruit.
layerSize was also decreased to 100 instead of the 300 from the word2vec manuscript, also for time considerations. A 100 dimensional vector for a word still feels like a whole lot.
In the end, I’m serializing the Word2Vec object to disk, such that I can play a bit with it afterwards without retraining over and over again.
So in another Java class I deserialze the file, and make use of the wordsNearest methods on the Word2Vec instance.
signaling is associated with a list of critical pathway genes
brca gene has well known mutation driving breast and ovarian and cancer. As expected, but still very clever of the word2vec model.
telomere is associated with systolic and diastolic, I do not fully grasp these associations. Telomerase is clear though.
angiogenesis is associated with adhesion, migration, invasion, healing, tnf. Again very strong of the model.
Conclusion I
It seems like the word2vec model, far from optimally trained on my macbook, did learn to make quite a few good associations from the model.
Lets do some negative testing with nonsense words.
Term
word2vec similar terms
kenny
harano, moo-penn, hamel, stevenson, male
university
pp, press, ed.), (pub.)
the
/
why
/
So my first name is associated with some authors, and university is shared with press and other. As expected, the and why which occur at random, don’t return any associtions. Great, its pretty good.
Finally, the word2vec examples are known for their analogies. France is to Paris, what Italy is to X. Word2vec can fill in Rome here by crunching Wikipedia. So can we try to find analogous terms for genotype-phenotype associations?
Once again very impressive. The addition of the breast vector and negation of the alk vector, yields a vector nearby ‘colorectal’. Indeed, in a cancer setting, brca means to breast what alk means to colorectal.
I have been wondering what all the noise about deep learning is about.
Its still neural networks, right? I have had not so much experience with
NN because they’re supposed to be hard to get right due to paramater
tuning, which is a downer if you’re used to good alround performers like
random forests. Still I decided to set out on a series of blogposts
using h2o (R) and deeplearning4j (R) on biotech datasets.
We’ll be working with the BreastCancer dataset from the mlbench package.
From the package description:
The objective is to identify each of a number of benign or malignant
classes. Samples arrive periodically as Dr. Wolberg reports his
clinical cases. The database therefore reflects this chronological
grouping of the data. This grouping information appears immediately
below, having been removed from the data itself. Each variable except
for the first was converted into 11 primitive numerical attributes
with values ranging from 0 through 10. There are 16 missing attribute
values. See cited below for more details.
data("BreastCancer",package="mlbench")
Let’s do some data munging.
BreastCancer%<>%as.data.table # remove NA values for simplicity BreastCancer%<>%na.omit
# get all nominal values as numeric x<-apply(x,2,as.numeric)%>%data.table
Prepare test/training splits.
# split the data in test/train set.seed(10000) splits<- sample(x=c("train","test"), size=NROW(x), replace=T, prob=c(0.7,0.3)) features<-split(x,splits) response<-split(y,splits)
As a reference, how good does it get with minimum effort using random forest classification?
##
## Call:
## randomForest(x = features$train, y = as.factor(response$train), xtest = features$test, ytest = response$test, proximity = TRUE, keep.forest = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.21%
## Confusion matrix:
## benign malignant class.error
## benign 294 8 0.02649007
## malignant 7 158 0.04242424
## Test set error rate: 2.31%
## Confusion matrix:
## benign malignant class.error
## benign 139 3 0.02112676
## malignant 2 72 0.02702703
Test set error rate is at 2.31% without a lot of effort. There is not a
lot of room for improvement. (So maybe this is not the best dataset.)
One of the nice things about randomforests is that they’re easy to understand by looking at the variable importance plot.
varImpPlot(rf)
Demonstrates as expected that cell size and shape are most predictive
features for the breast cancer RF classifier. We can inspect per feature
decision surfaces, as plotted below where malignant weight increases
with higher value of cell size.
So how good does it get using h2o deeplearning without much finetuning?
Before this analysis, I had already setup the h2o R package. Instructions for running h2o are nicely summarized here. So I can
now simply fire up a local instance for testing with the following
command.
Now build a model with default parameters. With h2o, you have to
specify predictor and response variables by column index or by column
name. Here, we are using column names. (Have a look at the magrittr R package if you’re confused by the ‘%>%’ operator.)
## Deep Learning Model Key: DeepLearning_824431d8d17d566b8e213c76233147cf
##
## Training classification error: 0.0235546
##
## Validation classification error: 0.01388889
##
## Confusion matrix:
## Reported on Last.value.1
## Predicted
## Actual benign malignant Error
## benign 139 3 0.02113
## malignant 0 74 0.00000
## Totals 139 77 0.01389
##
## AUC = 0.9942901 (on validation)
Clearly better than the RF, out of the box without many finetuning. The overall error rate dropped to 1.38%, but more importantly the sensitivity detecting malignant cases went up to 100%.
Lets try and turn some knobs and evaluate what happens.
p<-ggplot(data=varimp,aes(x=feature,y=importance)) p<-p+geom_bar(stat="identity") p<-p+coord_flip() p
So while cell shape was determining the RF a lot, its of minor
importance in the DL model. And while the mitoses did not contribute at
all to the RF, it has a lot of importance in the DL model. Actually all of the features seem to be used by the DL model, so is it making better use of all available information?
The goal of this blog is to bring more focus into those little projects one does as an extra. Learning a new package, a couple of random thoughts or a strong opinion about a particular story. We’ll not try to focus to much about technology, but put our minds into the data, week by week.