GeneWalk determines for individual genes the functions that are relevant in a particular biological context and experimental condition. GeneWalk quantifies the similarity between vector representations of a gene and annotated GO terms through representation learning with random walks on a condition-specific gene regulatory network. Similarity significance is determined through comparison with node similarities from randomized networks.
1. Prepare a gene list
GeneWalk requires as an input a text file containing a list with genes of interest relevant to the biological context. In this tutorial we use differentially expressed genes that result from the Qki gene deletion (context) in an RNA sequencing experiment on mouse brains.
We prepare a text (.csv) file with MGI gene IDs of all mouse DE genes. The file can be found here. Each line contains a gene identifier.
TIP (optional): order the genes according to differential expression strength so that your top hits are listed first. The GeneWalk output file maintains this gene order for your convenience.
In general, GeneWalk currently supports gene list files containing HGNC human gene symbols, HGNC IDs, or mouse gene IDs (with or without MGI: prefix).
2. Install GeneWalk
GeneWalk is built in Python 3. Check python.org on how to install Python on your computer. Then simply install GeneWalk by running in your terminal:
pip install genewalk
3. Run GeneWalk
GeneWalk can be run as a stand alone program in your terminal:
genewalk --project qki --genes /home/QKI_forGW.csv --id_type mgi_id
/home/QKI_forGW.csv with your input file path.
TIP: reduce the runtime through parallelization. Many computers have 4 (or more) processors, make use of them by adding
--nproc 4 to the above command. For more details on how to customize the command to your needs, check out our GitHub page.
GeneWalk will run and generate a folder according to the chosen project name
/home/genewalk/qki/ to save all results files. For example it provides a log text file
genewalk_all.log to follow its progress. Below we explain the GeneWalk algorithm in more detail. If you are more interested in the output results, feel free to proceed with step 4.
GeneWalk network assembly
First, the mouse genes are mapped to human orthologs. It is expected that not all mouse genes can be mapped. The program provides warnings for these cases to make you aware but continues to run as expected. In our case of QKI, we find that for ~5% of our 1861 input genes no human homolog was found and are not included for further analysis.
Next, all reactions between the input genes are extracted from a knowledge base, for instance INDRA or in this case Pathway Commons, and automatically assembled into a gene network. This key step enables function relevance predictions that are specific to the experimental context. Some input genes have no reactions with other input genes so there is insufficient context-specific information on that gene to make relevance predictions and thereby not included in the GeneWalk network. Overall we find for QKI that ~85% of the input genes are present in the network and end up having annotation relevance scores.
Lastly, the GO ontology and annotations are added to the network, resulting in the full GeneWalk network. For QKI, the subnetwork related to DE genes Mal, Pllp and Plp1 and their annotations is visualized to give an impression of the amount of knowledge available on just three genes.
Network representation learning
To determine how genes and GO terms constituting the GeneWalk Network relate to one another, we apply a network representation learning algorithm based on random walks (DeepWalk) that was first developed for social media networks. In essence it converts network nodes to vectors based on the structure of the GeneWalk network: groups of interconnected genes and GO terms that are mechanistically or functionally linked to each other occur most frequently as gene–GO term neighboring pairs on the sampled random walks and will look similar in vector representation as quantified by their cosine similarity. For more details see our bioRxiv paper.
Next, GeneWalk calculates whether a cosine similarity value between a gene and GO term is higher than expected by chance using a significance test. By comparing the similarity to similarity values between node vectors arising from random networks, we obtain a p-value. Because a gene can have multiple GO annotations, we apply a multiple testing correction (Benjamin-Hochberg false discovery rate FDR procedure), yielding a corrected p-adjust value. This procedure is repeated multiple times to retrieve a robust mean estimate (mean p-adjust) and its uncertainty (95% confidence interval) as final statistical significance relevance results.
4. Interpretation of GeneWalk results
The output of GeneWalk is a comma separated text file
genewalk_results.csv with GO annotations (go_name and go_id columns) ranked quantitatively by statistical significance, the mean FDR adjusted p-value (mean_padj), for each input mouse gene (mgi_id). Its mapped human ortholog (hgnc_symbol, hgnc_id) is also indicated and the annotations are grouped by the three GO domains.
So from here you can search for your gene of interest and directly see the most relevant GO terms.
To indicate the uncertainty on our FDR adjusted p-value estimates, we provide 95% confidence intervals (cilow_padj and ciupp_padj). An FDR significance level (e.g. mean_padj < 0.05) can be used as a threshold to classify GO annotations as significant or not in this particular experimental context. Also shown is the number of connections the gene and GO term have in the GeneWalk network (ncon_gene and ncon_go respectively).
For completeness (not shown in image above), the results file also provides the above statistics for the mean p-value (not adjusted for multiple annotation testing: mean_pval, cilow_pval, ciupp_pval) and the mean cosine similarity (and standard error: mean_sim and sem_sim).
TIP (optional): to perform a connectivity analysis as described in our biorxiv paper programmatically: the gene connectivity with other genes equals ncon_gene minus the number of GO annotations (number of rows per gene). The GeneWalk network (networkx format) itself is also output as a
multi_graph.pkl file (pickle binary format), which can be loaded into Python for further analysis.
More details on the output can be found on our GitHub page.
5. Visualization of function relevance ranking
To visualize the relevance ranking for individual genes, we plot a bar chart of GO terms ranked according to the -log10(mean_padj). Here we show the results for the Mal and Plp1 gene.
TIP: the error bars correspond to the 95% confidence intervals as described above. The FDR significance level of 0.05 (red dashed line) is also shown. It becomes apparent that multiple annotations can have similar relevance levels. So have a look beyond the top GO annotation in your GeneWalk results list.