Building a simple word2vec model on OMIM database.

When reading about deep learning, I found the word2vec manuscript awesome. Its a relatively simple concept to transform a word by its context into a vector representation, but I was amazed that the mathematical distance between these vectors actually turned out to keep actual meaning.

word2vec manuscript

So, great. Now can we all agree its impressive that computers can learn how France->Paris is like Italy->Rome. but how useful is it if we give it a brief shot on medical genetic data?

I decided to make use of the NCBI OMIM database as a text corpus to build the word2vec model.

OMIM: Online Mendelian Inheritance In Man

OMIM is a comprehensive, autorative compendium of human genes and genetic phenotypes that is freely available. Authored and edited by Institute of Genetic Medicine, Johns Hopkins University School of Medicine

Willing to be productive as quick as possible, I decided to work with deeplearning4j as I am familiar with Java for the last 10 years. And I am pretty fond of spring-boot these days, so I could easily share the outcome of this experiment as a service in the future.

I first got up to speed with deeplearning4j by their tutorials on their home page, more specifically the one about word2vec.

Ok, so I downloaded the whole omim database, which is a 178MB txt file that I made available as a Java ClassPathResource which is fed into a SentenceIterator from deeplearning4j.

-rw-r--r--   1 kennyhelsens  staff   178M May 28 15:00 omim.txt
ClassPathResource resource = new ClassPathResource("omim.txt");
SentenceIterator iter = new LineSentenceIterator(
iter.setPreProcessor(new SentencePreProcessor() {
public String preProcess(String sentence) {
return sentence.toLowerCase();

Next, just following the instructions from deeplearning4j, I decided to make use of the default TokenizerFactory, and we’re already fine to give it a first shot with a minimum configuration. (I’m running this during my daily train ride on my 3 year old macbook pro.)

TokenizerFactory t = new DefaultTokenizerFactory();

Word2Vec vec = new Word2Vec.Builder().

SerializationUtils.saveObject(vec, new File("vec-omim.ser"));

So what does this configuration mean?

  • We set minWordFrequency to 20 to leave out very sparse words.
  • I’ve decreased iterations and increased the minLearningRate a little bit to make faster progress. I don’t intend to write an academic paper here, just looking for some low hanging fruit.
  • layerSize was also decreased to 100 instead of the 300 from the word2vec manuscript, also for time considerations. A 100 dimensional vector for a word still feels like a whole lot.

In the end, I’m serializing the Word2Vec object to disk, such that I can play a bit with it afterwards without retraining over and over again.

So in another Java class I deserialze the file, and make use of the wordsNearest methods on the Word2Vec instance.

Word2Vec vec =
SerializationUtils.readObject(new File("vec-omim.ser"));

similar = vec.wordsNearest("alk");
System.out.println("alk" + ":" + similar);
Term word2vec similar terms
alk nonsmall, carcinomas, carcinoma, mapdkd

Wow. This is pretty cool. Alk is a known oncogene in non-small cell lung carcinoma.

Lets try a few more.

Term word2vec similar terms
invasion metastasis, adhesion, invasiveness, migration, factor-alpha, colony, anchorage-independent, tumorigenicity, proliferation
angiogenesis adhesion, apoptosis, proliferation, migration, invasion, tnf, healing, anchorage-independent, factor-alpha
signaling mapkd, tgfbd, mapkdd, shh, erkd, cdknda, wnt, nfkb, erk
aneurysm valve, arteriosu, ductus, dissection
telomere break, fork, systolic, diastolic, clavicles, telomerase
lyme erythematosus, rheum, allergy
brca breast-ovarian, nonsmall, carcinoma, cancers, nonpolyposi
alzheimer ad, pd, crohn, parkinson, celiac
  • signaling is associated with a list of critical pathway genes
  • brca gene has well known mutation driving breast and ovarian and cancer. As expected, but still very clever of the word2vec model.
  • telomere is associated with systolic and diastolic, I do not fully grasp these associations. Telomerase is clear though.
  • angiogenesis is associated with adhesion, migration, invasion, healing, tnf. Again very strong of the model.

Conclusion I
It seems like the word2vec model, far from optimally trained on my macbook, did learn to make quite a few good associations from the model.

Lets do some negative testing with nonsense words.

Term word2vec similar terms
kenny harano, moo-penn, hamel, stevenson, male
university pp, press, ed.), (pub.)
the /
why /

So my first name is associated with some authors, and university is shared with press and other. As expected, the and why which occur at random, don’t return any associtions. Great, its pretty good.

Finally, the word2vec examples are known for their analogies. France is to Paris, what Italy is to X. Word2vec can fill in Rome here by crunching Wikipedia. So can we try to find analogous terms for genotype-phenotype associations?

Positive Terms Negative Terms Word2vec analogous Terms
+brca +breast -alk nonpolyposis, carcinoma, nonsmall, squamous, colorectal

Once again very impressive. The addition of the breast vector and negation of the alk vector, yields a vector nearby ‘colorectal’. Indeed, in a cancer setting, brca means to breast what alk means to colorectal.

Any remarks or suggestions, let me know below!

Testing randomForest and H2O deep learning

I have been wondering what all the noise about deep learning is about. Its still neural networks, right? I have had not so much experience with NN because they’re supposed to be hard to get right due to paramater tuning, which is a downer if you’re used to good alround performers like random forests. Still I decided to set out on a series of blogposts using h2o (R) and deeplearning4j (R) on biotech datasets.

We’ll be working with the BreastCancer dataset from the mlbench package. From the package description:

The objective is to identify each of a number of benign or malignant classes. Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself. Each variable except for the first was converted into 11 primitive numerical attributes with values ranging from 0 through 10. There are 16 missing attribute values. See cited below for more details.

data("BreastCancer", package = "mlbench")

Let’s do some data munging.

BreastCancer %<>%
# remove NA values for simplicity
BreastCancer %<>% na.omit

y <- BreastCancer$Class %>% as.character() %>% as.factor
x <- BreastCancer %>% select(2:(NCOL(.)-1))

# get all nominal values as numeric
x <- apply(x, 2, as.numeric) %>% data.table

Prepare test/training splits.

# split the data in test/train
splits <-
sample(x = c("train","test"),
size = NROW(x),
replace = T,
prob = c(0.7,0.3))
features <- split(x, splits)
response <- split(y, splits)

As a reference, how good does it get with minimum effort using random forest classification?

rf <- randomForest(x = features$train,
y = as.factor(response$train),
xtest = features$test,
ytest = response$test,
keep.forest = TRUE,
proximity = TRUE)
## Call:
##  randomForest(x = features$train, y = as.factor(response$train),      xtest = features$test, ytest = response$test, proximity = TRUE,      keep.forest = TRUE)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
##         OOB estimate of  error rate: 3.21%
## Confusion matrix:
##           benign malignant class.error
## benign       294         8  0.02649007
## malignant      7       158  0.04242424
##                 Test set error rate: 2.31%
## Confusion matrix:
##           benign malignant class.error
## benign       139         3  0.02112676
## malignant      2        72  0.02702703

Test set error rate is at 2.31% without a lot of effort. There is not a lot of room for improvement. (So maybe this is not the best dataset.) One of the nice things about randomforests is that they’re easy to understand by looking at the variable importance plot.


Demonstrates as expected that cell size and shape are most predictive features for the breast cancer RF classifier. We can inspect per feature decision surfaces, as plotted below where malignant weight increases with higher value of cell size.

response = features$train,
features = "Cell.size",
main = "",
which.class = "malignant")

So how good does it get using h2o deeplearning without much finetuning?

Before this analysis, I had already setup the h2o R package. Instructions for running h2o are nicely summarized here. So I can now simply fire up a local instance for testing with the following command.

instance <- h2o.init()

Lets load the training and test data into h2o.

h2orefs <- list()
h2orefs$train <- as.h2o(instance, cbind(features$train, Class=response$train))

h2orefs$test <- as.h2o(instance, cbind(features$test, Class=response$test))

Now build a model with default parameters. With h2o, you have to specify predictor and response variables by column index or by column name. Here, we are using column names. (Have a look at the magrittr R package if you’re confused by the ‘%>%’ operator.)

h2orefs$model <-
h2o.deeplearning(x = features$train %>% colnames,
y = "Class",
data = h2orefs$train,
validation = h2orefs$test,
classification = TRUE)

## Deep Learning Model Key: DeepLearning_824431d8d17d566b8e213c76233147cf
## Training classification error: 0.0235546
## Validation classification error: 0.01388889
## Confusion matrix:
## Reported on Last.value.1
##            Predicted
## Actual      benign malignant   Error
##   benign       139         3 0.02113
##   malignant      0        74 0.00000
##   Totals       139        77 0.01389
## AUC =  0.9942901 (on validation)

Clearly better than the RF, out of the box without many finetuning. The overall error rate dropped to 1.38%, but more importantly the sensitivity detecting malignant cases went up to 100%.

Lets try and turn some knobs and evaluate what happens.

Adding more layers and neurons?

h2orefs$model <-
h2o.deeplearning(x = features$train %>% colnames,
y = "Class",
data = h2orefs$train,
validation = h2orefs$test,
classification = TRUE,
hidden = c(1000, 500, 300))

## Deep Learning Model Key: DeepLearning_aa147b54007becb9e8434eb8100a4ab9
## Training classification error: 0.02141328
## Validation classification error: 0.01388889
## Confusion matrix:
## Reported on Last.value.1
##            Predicted
## Actual      benign malignant   Error
##   benign       139         3 0.02113
##   malignant      0        74 0.00000
##   Totals       139        77 0.01389
## AUC =  0.9943852 (on validation)

Not much effect here.

Decrease the number of neurons, and add some regularization?

h2orefs$model <-
h2o.deeplearning(x = features$train %>% colnames,
y = "Class",
data = h2orefs$train,
validation = h2orefs$test,
classification = TRUE,
hidden = c(10, 20, 30),
input_dropout_ratio = 0.3,
hidden_dropout_ratios = c(0.3,0.3,0.3),
epochs = 50,
l1 = 0.0005,
l2 = 0.0005)

## Deep Learning Model Key: DeepLearning_ae8df876b491d8bdbaf64926c7b64c59
## Training classification error: 0.0235546
## Validation classification error: 0.01851852
## Confusion matrix:
## Reported on Last.value.1
##            Predicted
## Actual      benign malignant   Error
##   benign       138         4 0.02817
##   malignant      0        74 0.00000
##   Totals       138        78 0.01852
## AUC =  0.9943852 (on validation)

Not much effect. It would be interesting to whether the false positives evolved into malignant tissue after the data was collected for this study.

Now lets have a look at the variable importance derived from the DL classifier.

h2orefs$model <-
h2o.deeplearning(x = features$train %>% colnames,
y = "Class",
data = h2orefs$train,
validation = h2orefs$test,
classification = TRUE,
variable_importances = TRUE)

t(data.frame(h2orefs$model@model$varimp)) %>%
data.table(feature = rownames(.), importance = .[,1]) %>%

varimp$feature %<>%
factor(., levels = unique(.))
p <- ggplot(data = varimp, aes(x = feature, y = importance))
p <- p + geom_bar(stat = "identity")
p <- p + coord_flip()

So while cell shape was determining the RF a lot, its of minor importance in the DL model. And while the mitoses did not contribute at all to the RF, it has a lot of importance in the DL model. Actually all of the features seem to be used by the DL model, so is it making better use of all available information?

First post

The goal of this blog is to bring more focus into those little projects one does as an extra. Learning a new package, a couple of random thoughts or a strong opinion about a particular story. We’ll not try to focus to much about technology, but put our minds into the data, week by week.