The calculation of the Functional-And-Evolutionary Trait Heritability (FAETH) score is base on the genome-wide restricted maximum likelihood (GREML) analysis using multiple Genomic Relationship Matrix (GRM) made from different sets of sequence variants partitioned by external functionally and/or evolutionary data (https://www.biorxiv.org/content/10.1101/601658v2). The construction of GRM and GREML analysis used GCTA (https://cnsgenomics.com/software/gcta/#Overview) and/or MTG2 (https://sites.google.com/site/honglee0707/mtg2). These software are fully open-access and well-documented and thus, this tutorial focuses on the FAETH computation after heritability estimates are obtained for each GRM.
The demon data ‘faeth.demo.tar.gz’ can be downloaded at https://melbourne.figshare.com/s/f42b718e81e63dc488ac . In the unzipped folder of ‘faeth.demo’, there are four text files: demo.allgrm.h2.sum.txt, demo.grm.h2.sum.txt, demo.size.grm.txt and mtg2.grm01.tr01.hsq and a sub-folder ‘snplist/’. These data are needed to calculate FAETH score.
mtg2.grm01.tr01.hsq is an example of GREML output from MTG2 for grm01 (1st GRM) fitted with hd grm (2nd GRM) for trait 01 and its header looks like:
Ve 0.3436 0.0049
Va 0.0050 0.0023
Va 0.6150 0.0422
h2 0.0052 0.0024
h2 0.6382 0.0172
The bottom 2 lines of h2 are the heritability estimates for the first GRM and the HD GRM with their standard errors. 2 GRMs are fitted together for trait 01. Clearly in this case HD GRM sucked up all variants, suggesting that the functional /evolutionary GRM is not so important.
If you use approached described in Xiang et al 2019 to estimate FAETH score, you will have multiple files like this across different GRMs and different traits. You could use some simple shell-loop script to glue all results together, for example:
for grmNum in 01
do for trNum in 01
do paste -d ' ' <(paste -d ' ' <(echo grm${grmNum}) <(echo tr${trNum}) |cat - <(paste -d ' ' <(echo hd) <(echo tr${trNum}))) <(grep mtg2.grm${grmNum}.tr${trNum}.hsq -e 'h2'|awk '{print$2,$3}')
done
done
By looping through REML results across multiple traits and GRMs, you will get a summary file like demo.grm.h2.sum.txt:
hprs tr01 0.0144 0.0053
hd tr01 0.6135 0.0217
hprs tr02 0.0247 0.0063
hd tr02 0.5611 0.0234
hprs tr03 0.005 0.0038
hd tr03 0.4055 0.0268
This file contains summary of heritability for functional/evolutionary GRM that were fitted with the HD GRM. The 2nd column denotes the trait with which the 2 GRMs were fitted together. The 3rd and 4th columns are the heritability estimates and its standard errors, respectively. Data for 10 traits are included. Other functional/evolutionary GRMs fitted together with HD GRM in this demo example included: variants tagged by Chip-Seq peaks (ChipSeq), variants associated with concentrations of metabolites of bovine milk (mqtl), variants associated with RNA splicing (sqtl) and variants that appeared and/or are selected recently (young).
Note 1: for GRMs of hprs, ChipSeq, mqtl, sqtl and young, themselves do not cover the entire genome. For example, the genome can be partitioned into ‘sqtl’ and the ‘rest of qtl’. In order to calculate the FAETH score for all variants, we need to derive the heritability of the ‘the rest genome’ for each one of these GRMs. This derivation is based on the estimation of the heritability of ‘all variants’. Given that we are estimating additive genetic variance, we subtract the heritability of a targeted GRM (e.g., sqtl) from the heritability of ‘all variants’ (described in following) to derive the heritability of the rest variants.
Note 2: in the file ‘demo.grm.h2.sum.txt’ there are also GRMs labeled as: ‘UTR’, ‘intron’, ‘noncoding.related’, ‘splice’, ‘merged.coding’, ‘intergenic_variant’ and ‘geneend’. These 7 GRMs together make up the genome partition of ‘variant annotation’ and cover the whole genome. Therefore, they will not need the heritability of the ‘rest variants’. In the FAETH score calculation, a single vector for the score for the partition ‘annotation’ will be calculated and this single vector will have 7 unique values corresponding to the 7 GRMs made of 7 sets of variants that make up the annotation genome partition.
‘demo.allgrm.h2.sum.txt’ has the same format as ‘demo.grm.h2.sum.txt’, except that instead of functional/evolutionary GRM was fitted together with the HD GRM, all variants GRM (over 17 millions sequence variants in the study of Xiang et al 2019) were fitted with the HD GRM. The header of ‘demo.allgrm.h2.sum.txt’:
allsnp tr01 0.6225 0.0156
hd tr01 0 0.0076
allsnp tr02 0.7076 0.0146
hd tr02 0 0.0085
allsnp tr03 0.6873 0.0134
hd tr03 0 0.0068
In the sub-folder of faeth.demo there are lists of variants corresponding to each GRM in this demo. You may have used these lists to create GRMs with GCTA before heritability estimates were obtained. Here we need them to calculate per-variant heritability which is an essential part of the FAETH score. You could use a shell line to generated the GRM size file (already exist in the demo file) corresponding to each GRM:
wc -l snplist/*.snplist|awk '!/total/{print$2,$1}'|sed "s|snplist/||g" > demo.size.grm.txt
The header of ‘demo.size.grm.txt’:
ChipSeq.snplist 1166795
UTR.snplist 42350
allsnp.snplist 17669372
coding.related.snplist 105969
geneend.snplist 1007214
hprs.snplist 169773
Now we have finished preparing/describing the files, a R script is provided to calculate FAETH score for the demo data. This script will generate the FAETH score for 12 functional/evolutionary GRMs out of 6 genome partitions (see note 1 and 2) across 10 traits for over 17 millions sequence variants. It is recommended to run the job on an HPC. However, I have tested this pipeline on my laptop (DELL latitude E7270, 16G RAM) and it took several minutes. Therefore, I assume it should work for normal PCs.
Set the working directory to the directory of faeth.demo. Two libraries, ‘data.table’ and ‘reshape2’ will need to be loaded. Note that data.table is handy for process large tables.
#use your own path that contain the directory of faeth.demo
setwd("C:.../faeth.demo")
library(data.table)
library(reshape2)
dat <- read.table('demo.grm.h2.sum.txt',header=F)
dat.all <- read.table('demo.allgrm.h2.sum.txt',header=F)
size <- read.table('demo.size.grm.txt',header=F)
#organise data
dat1 <- merge(size,dat[-(grep('hd',dat[,1])),],by='V1')
colnames(dat1) <- c('grm','size','tr','h2','se')
dat2 <- merge(unique(dat1[,1:2]),data.table::dcast(dat1[,c(1,3,4)],grm~tr),by='grm')
colnames(dat2)[grep('tr',colnames(dat2))] <- paste0(colnames(dat2)[grep('tr',colnames(dat2))],'.h2')
> dat2[1:5,1:5]
grm size tr01.h2 tr02.h2 tr03.h2
1 ChipSeq 1166795 0.2170 0.1976 0.1190
2 coding.related 105969 0.0112 0.0130 0.0037
3 geneend 1007214 0.2144 0.1906 0.0934
4 hprs 169773 0.0144 0.0247 0.0050
5 intergenic_variant 11869145 0.2458 0.3254 0.2612
#define grms need rest grm
pt.list <- c('ChipSeq','hprs','mqtl','sqtl','young')
#define grms form a complete genome partition
anno.list <- c('UTR','coding.related','geneend','intergenic_variant','intron','noncoding.related','splice.site')
#make rest grm for for those ones that needed
dat3 <- dat2[dat2[,1] %in% pt.list,]
dat.all1 <- dcast(dat.all[!grepl('hd',dat.all[,1]),],V1~V2,value.var='V3')
colnames(dat.all1)[grep('tr',colnames(dat.all1))] <- paste0(colnames(dat.all1)[grep('tr',colnames(dat.all1))],'.h2')
dat.all2 <- merge(size,dat.all1,by='V1')
colnames(dat.all2)[1:2] <- c('grm','size')
#do a check
colnames(dat.all2)==colnames(dat3)
#process the heritability of 'rest variants' and combine data
dat4 <- do.call(rbind,apply(dat3[,-1], 1, function(x) dat.all2[,-1]-x))
dat5 <- cbind(dat3[,1],dat4)
colnames(dat5)[1] <- 'grm'
dat5[,1] <- paste0(dat5[,1],'.rest')
#combine all data
all.dat <- rbind(dat2,dat5)
#write.table(all.dat,'h2.raw.res',row.names=F,quote=F)
#calculate per-variant h2
all.dat1 <- all.dat[,-(1:2)]/all.dat$size
all.dat2 <- cbind(all.dat[,1,drop=F],all.dat1)
#calculate trait-wise average heritability
all.dat2$ave.h2 <- rowMeans(all.dat2[,-1],na.rm=T)
The last column of ‘all.dat2’ will be the per-variant heritability for each GRM:
> all.dat2[1:4,c(1:3,ncol(all.dat2))]
grm tr01.h2 tr02.h2 ave.h2
1 ChipSeq 1.859795e-07 1.693528e-07 7.782858e-08
2 coding.related 1.056913e-07 1.226774e-07 4.557940e-08
3 geneend 2.128644e-07 1.892349e-07 8.057871e-08
4 hprs 8.481914e-08 1.454884e-07 3.940558e-08
From here we will use the par-variant to calculate FAETT score with a R loop and data.table syntax:
listpath <- 'snplist/'
sllist <- list()
for (i in list.files(path=listpath,pattern='*.snplist'))
{sllist[[i]] <- fread(paste0(listpath,i),header=F)
}
faethd <- sllist[['allsnp.snplist']]
colnames(faethd)[1] <- 'SNP'
#do grms that need rest grm
for (grmtype in pt.list)
{faethd[,paste0(grmtype):=ifelse(SNP %in% unlist(sllist[[which(names(sllist)==paste0(grmtype,'.snplist'))]]$V1),all.dat2[all.dat2$grm==grmtype,]$ave.h2,all.dat2[all.dat2$grm==paste0(grmtype,'.rest'),]$ave.h2)]
}
setkey(faethd,SNP)
#do grms that form a complete partition
fdlist <- list()
#only 1 list as example is given, could have more than 1
for (fd in c('anno.list'))
{grmtypelist <- list()
for (grmtype in get(fd))
{grmtypelist[[grmtype]] <- cbind(sllist[[which(names(sllist)==paste0(grmtype,'.snplist'))]],all.dat2[all.dat2$grm==grmtype,]$ave.h2)
}
tmp <- do.call(rbind,grmtypelist)
colnames(tmp) <- c('SNP',gsub('.list','',fd))
setkey(tmp,SNP)
fdlist[[fd]] <- tmp
}
#merge all lists and calculate FAETH score (rwoMeans)
faethd1 <- Reduce(function(...) merge(...,all=T,by="SNP"),c(list(faethd),fdlist))
faethd1[,faeth.sc:=rowMeans(faethd1[,-1])]
Note that the calculations loop for GRMs that needed or not needed the heritability of ‘rest variants’ are different.
> faethd1[1:6,1:6]
SNP ChipSeq hprs mqtl sqtl young
Chr10:100000072 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000080 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000085 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000144 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000208 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000212 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
#=====save calculated FAETH data=====
#write.table(faethd1,gzfile('demo.faeth.txt.gz'),row.names=F,quote=F)
To this end we have calculated the FAETH score for over 17M variants using heritability estimates from 12 GRMs (6 genome partitions) across 10 traits. The idea of FAETH score is to average the par-variant heritability of each variant across all functional categories. By doing this, each variant has a single vector of ranking for importance of functionality, evolution and trait heritability. This single vector variant-ranking can be easily used with other analysis, such as a biological prior for BayesRC (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2443-6). With more categories of functional/evolutionary data being included, FAETH score will be more accurate.
Reference: Xiang, R. et al. Quantifying the contribution of sequence variants with regulatory and evolutionary significance to 34 bovine complex traits. Proceedings of the National Academy of Sciences 116, 19398-19408, doi:10.1073/pnas.1904159116 (2019).