GeneWalk determines for individual genes the functions that are relevant in a particular biological context and experimental condition. GeneWalk quantifies the similarity between vector representations of a gene and annotated GO terms through representation learning with random walks on a condition-specific gene regulatory network. Similarity significance is determined through comparison with node similarities from randomized networks.



1. Prepare a gene list

GeneWalk requires as an input a text file containing a list with genes of interest relevant to the biological context. In this tutorial we use differentially expressed genes that result from the Qki gene deletion (context) in an RNA sequencing experiment on mouse brains.

We prepare a text (.csv) file with MGI gene IDs of all mouse DE genes. The file can be found here. Each line contains a gene identifier.

TIP (optional): order the genes according to differential expression strength so that your top hits are listed first. The GeneWalk output file maintains this gene order for your convenience.

In general, GeneWalk currently supports gene list files containing HGNC human gene symbols, HGNC IDs, or mouse gene IDs (with or without MGI: prefix).

2. Install GeneWalk

GeneWalk is built in Python 3 (version 3.5 or higher). Check python.org on how to install Python on your computer. Then simply install GeneWalk by running in your terminal:

pip install genewalk

TIP: If this installation gave rise to any errors, this is most likely due to conflicting version dependencies with other Python packages installed on your system. We recommend to activate a new virtual environment and install (and run) GeneWalk from there.

3. Run GeneWalk

GeneWalk can be run as a stand alone program in your terminal:

genewalk --project qki --genes /home/QKI_forGW.csv --id_type mgi_id

Replace /home/QKI_forGW.csv with your input file path. The expected run time will be 6-12 hours.
TIP: reduce the runtime to 1-2 hours through parallelization. Many computers have 4 (or more) processors, make use of them by adding --nproc 4 to the above command.

GeneWalk will run and generate a folder according to the chosen project name /home/genewalk/qki/ to save all results files. For example, it provides a log text file genewalk_all.log to follow its progress.

genewalk --help
provides explanation of various optional arguments. For more details on how to customize the command to your needs, check out our GitHub page.

Below we explain the GeneWalk algorithm in more detail. If you are more interested in the output results, feel free to proceed with step 4.

GeneWalk network assembly

First, the mouse genes are mapped to human orthologs. It is expected that not all mouse genes can be mapped. The program provides warnings for these cases to make you aware but continues to run as expected. In our case of QKI, we find that for ~5% of our 1861 input genes no human homolog was found and are not included for further analysis.

Next, all reactions between the input genes are extracted from a knowledge base, for instance INDRA or in this case Pathway Commons, and automatically assembled into a gene network. This key step enables function relevance predictions that are specific to the experimental context.

Some input genes have no reactions with other input genes so there is insufficient context-specific information on that gene to make relevance predictions and thereby not included in the GeneWalk network. Overall we find for QKI that ~85% of the input genes are present in the network and end up having annotation relevance scores. The other genes will not have any GO annotations as output.

Lastly, the GO ontology and annotations are added to the network, resulting in the full GeneWalk network. For QKI, the subnetwork related to DE genes Mal, Pllp and Plp1 and their annotations is visualized to give an impression of the amount of knowledge available on just three genes.

Network representation learning

To determine how genes and GO terms constituting the GeneWalk Network relate to one another, we apply a network representation learning algorithm based on random walks (DeepWalk) that was first developed for social media networks. In essence it converts network nodes to vectors based on the structure of the GeneWalk network: groups of interconnected genes and GO terms that are mechanistically or functionally linked to each other occur most frequently as gene–GO term neighboring pairs on the sampled random walks and will look similar in vector representation as quantified by their cosine similarity. For more details see our bioRxiv paper.

Significance testing

Next, GeneWalk calculates whether a cosine similarity value between a gene and GO term is higher than expected by chance using a significance test. By comparing the similarity to similarity values between node vectors arising from random networks, we obtain a p-value. Because a gene can have multiple GO annotations, we apply a multiple testing correction (Benjamin-Hochberg false discovery rate FDR procedure), yielding a corrected p-adjust value. This procedure is repeated multiple times to retrieve a robust mean estimate (mean p-adjust) and its uncertainty (95% confidence interval) as final statistical significance relevance results.

GeneWalk subnetwork of three input genes. Highlighted edges are the GO annotations for  MaL.

GeneWalk subnetwork of three input genes. Highlighted edges are the GO annotations for MaL.


4. Interpretation of GeneWalk results

The output of GeneWalk is a comma separated text file genewalk_results.csv with GO annotations (go_name and go_id columns) ranked quantitatively by statistical significance, the mean FDR adjusted p-value (mean_padj), for each input mouse gene (mgi_id). Its mapped human ortholog (hgnc_symbol, hgnc_id) is also indicated and the annotations are grouped by the three GO domains.
So from here you can search for your gene of interest and directly see the most relevant GO terms.

Some input genes do not have any GO annotations or relevance scores listed, because no reactions with other input genes could be retrieved from the knowledge base. So for these genes there is insufficient context-specific information to make functional relevance predictions. Overall we find for QKI that ~85% of the input genes do have annotation relevance scores.


To indicate the uncertainty on our FDR adjusted p-value estimates, we provide 95% confidence intervals (cilow_padj and ciupp_padj). An FDR significance level (e.g. mean_padj < 0.05) can be used as a threshold to classify GO annotations as significant or not in this particular experimental context. Also shown is the number of connections the gene and GO term have in the GeneWalk network (ncon_gene and ncon_go respectively).

For completeness (not shown in image above), the results file also provides the above statistics for the mean p-value (not adjusted for multiple annotation testing: mean_pval, cilow_pval, ciupp_pval) and the mean cosine similarity (and standard error: mean_sim and sem_sim).

TIP (optional): to perform a connectivity analysis as described in our biorxiv paper programmatically: the gene connectivity with other genes equals ncon_gene minus the number of GO annotations (number of rows per gene). The GeneWalk network (networkx format) itself is also output as a multi_graph.pkl file (pickle binary format), which can be loaded into Python for further analysis.

More details on the output can be found on our GitHub page.

5. Visualization of function relevance ranking

To visualize the relevance ranking for individual genes, we plot a bar chart of GO terms ranked according to the -log10(mean_padj). Here we show the results for the Mal and Plp1 gene.

TIP: the error bars correspond to the 95% confidence intervals as described above. The FDR significance level of 0.05 (red dashed line) is also shown. It becomes apparent that multiple annotations can have similar relevance levels. So have a look beyond the top GO annotation in your GeneWalk results list.

Below we provide the corresponding Python code to generate the plots in a jupyter notebook. This is followed by some R code, to plot this chart in R.


Python code to plot bar charts


import other packages (numpy, pandas, matplotlib and seaborn)

Install these with pip install before importing for plotting purposes below.

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['pdf.fonttype'] = 42 #embed fonts into pdf figure
import seaborn as sns

Load GeneWalk results file

In [ ]:
path = '/home/genewalk/qki/'            
filename = 'genewalk_results.csv'

GW = pd.read_csv(path+filename) 

GW['mlog10padj'] = -np.log10(GW['mean_padj']) #calculate -log10 values
GW['mlog10padj_err'] = -np.log10(GW['cilow_padj']) - GW['mlog10padj'] #calculate -log10 error bars

Plot GeneWalk annotation relevance scores bar charts

In [ ]:
alpha_FDR_plot = 0.05
GENES = ['MAL','PLP1']

for gene in GENES:
    dplot = GW[GW['hgnc_symbol']==gene]    
    dplot = dplot.sort_values(by=['mean_padj','mean_sim'],ascending=[True,False])
    sns.set(style="whitegrid", color_codes=True)
    f, ax = plt.subplots(figsize=(2,0.25*len(dplot)))#units: inch
    ax = sns.barplot(x='mlog10padj', 
    plt.axvline(x=-np.log10(alpha_FDR_plot), color='r', linestyle='--')
    font_sz = 12
    plt.xticks(range(4), size=font_sz)
    plt.xlim(0, max(dplot['mlog10padj']+dplot['mlog10padj_err'])+0.3)
    plt.ylabel('', size=font_sz)

    filename = 'web_tutorial_qki_x_mlog10padj_y_GO_'+gene+'.'

R code to plot bar charts


Import relevant library (ggplot2)

Install ggplot2 with install.packages before importing for plotting purposes below.

path = '/home/genewalk/qki/'
filename = 'genewalk_results.csv' 

Load GeneWalk results file

GW = read.csv(paste(path,filename,sep=""),header = TRUE)

GW$mlog10padj = -log10(GW$mean_padj) #calculate -log10 values
GW$mlog10padj_err = -log10(GW$cilow_padj) - GW$mlog10padj #calculate -log10 error bars

Plot GeneWalk annotation relevance scores bar charts

alpha_FDR_plot = 0.05
GENES = c('MAL','PLP1')

for (gene in GENES){
  dplot = GW[which(GW$hgnc_symbol==gene),]    
  #for plotting: order go terms (from all GO domains) according to mean_padj (and if equal then sort by mean_sim)
  dplot$go_name = factor(dplot$go_name, 
                         levels = dplot$go_name[with(dplot,order(-mean_padj,mean_sim))])
  ymax=max(dplot$mlog10padj + dplot$mlog10padj_err) + 0.3
  f = ggplot(dplot, aes(go_name,mlog10padj)) + 
  theme_light(base_size=font_sz) +
  geom_bar(stat='identity',fill='royalblue',width=0.8) +
  geom_errorbar(aes(ymin=mlog10padj-mlog10padj_err, ymax=mlog10padj+mlog10padj_err),
                width=0.2, color='grey') +
  geom_hline(yintercept=-log10(alpha_FDR_plot), color='red', linetype='dashed') +
  xlab('') +
  ylab('-log10(p-adjust)') +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_text(size=font_sz*0.8)) +
                     limits = c(0, ymax)) +
  gonames = as.character(dplot$go_name)
  lchar = sapply(gonames, nchar)
  w0 = max(lchar)
  fname = paste('web_tutorialR_qki_x_mlog10padj_y_GO_',gene,'.',sep="") 
         width = (ymax*0.3+w0*0.1), 
         width = (ymax*0.3+w0*0.1), 


## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] ggplot2_2.2.1
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       digest_0.6.15    plyr_1.8.4       grid_3.5.0      
##  [5] gtable_0.2.0     magrittr_1.5     evaluate_0.14    scales_0.5.0    
##  [9] pillar_1.2.3     rlang_0.2.1      stringi_1.2.3    lazyeval_0.2.1  
## [13] rmarkdown_1.15   tools_3.5.0      stringr_1.3.1    munsell_0.5.0   
## [17] xfun_0.9         yaml_2.1.19      compiler_3.5.0   colorspace_1.3-2
## [21] htmltools_0.3.6  knitr_1.25       tibble_1.4.2

More info on GeneWalk