This is an example of using my toolkit to do the HPO GSEA.
About the toolkit installation, see the GitHub repo.
In this example, we use the sex-associated genes as input example, it's download from the SAGD database. Data group SAGD_00055(human hypothalamus tissue) is selected as input.
# load required modules
import pandas as pd
import matplotlib.pyplot as plt
from bokeh.plotting import show
from bokeh.io import output_notebook
output_notebook()
from hpoea.enrich import GSEA
from hpoea.plot import LineagePlot, dot_plot
Read input differential gene experission result table.
input_csv = "../data/SAGD_00055.csv"
exp = pd.read_csv(input_csv)
exp.columns = ["ensembl_id"] + list(exp.columns)[1:]
exp.head(3)
Select the significant differential expressed genes with condition: padj <= 0.05
sig = exp[exp.padj <= 0.05]
test_genes = list(sig.ensembl_id)
len(test_genes)
Total 54 genes was selected, here is the gene list:
for i in range(len(test_genes)//5 + 1):
print(" ".join(test_genes[i*5:(i+1)*5]))
As we can see, the gene name is in Ensembl format, but HPO require Entrez format. We convert them firstly.
from hpoea.utils.idconvert import EntrezEnsemblConvert
cvt = EntrezEnsemblConvert()
test_entrez = cvt.ensembl2entrez(test_genes)
Unfortunatlly, 18 genes was lost during the conversion process.
gsea = GSEA()
gsea.enrich(test_entrez)
gsea.multiple_test_corretion(method='fdr_bh')
gsea.enrichment_table.head(1)
gsea.enrichment_table.shape[0]
Filter the enrichment results with condition: padj <= 0.05
gsea.filter(by='padj', threshold=0.05)
gsea.enrichment_table
Using this strict condition, we got 5 significant enriched HPO Terms.
Dot plot:
p = dot_plot(gsea.enrichment_table, size=20, x='pvalue')
show(p)
Ontology lineage plot:
terms = list(gsea.enrichment_table.HPO_term_ID)
lin = LineagePlot()
fig, ax = plt.subplots(figsize=(20, 10))
lin.plot(terms, ax=ax)