Dataset creation description


Collection of Data About Genes (the Instances) and Classes

Age-related brain (cerebellum, cerebrum, hippocampus and striatum) gene expression was collected from GEO and AgeMap using 118 mouse brain samples. Several filtering and post-processing steps were carried out, and at the end of these steps we retrieved three sets of human genes, respectively: "overexpressed with age in the brain", "underexpressed with age in the brain", and with "no change in expression with age in the brain". The link to the full paper with more details about the instance creation procedure will be available in this page after publication.

In the final dataset, the proportion of human protein-coding genes within each class is 2.4%, 0.8% and 96.8% for the classes `overexpressed', `underexpressed' and `no change of expression' with age in the brain, respectively.


Gene Ontology (GO) Terms-based Features

The instances (genes) were described by features representing the presence or absence of a GO term. We use GO term features because they are very well-known and easy to interpret - they use a controlled vocabulary, curated by experts, so the terms have well-defined biological meanings.

The list of GO terms associated with each gene was retrieved from the NCBI web page (retrieved on the 18th of April 2017), and the Gene Ontology definition (retrieved on the 14th of March 2017) was downloaded from the Gene Ontology web page.

We have used the GO hierarchy to expand the set of GO terms associated with each instance (gene) to include the list of all its ancestors, and we have eliminated GO terms that annotate less than 10 instances. After these steps, we ended up with a dataset with 17,684 genes (instances) and 7489 GO terms (features). We also added to the dataset a numerical feature whose value is the total number of GO terms annotated for a gene.

For more details, please access the paper.