Calculation of FAETH score

The FAETH score denotes Functional-And-Evolutionary Trait Heritability and it represents a ranking of genome-wide sequence variants for their functionality, evolutionary significance and contribution to heritability of 34 complexes trait in cattle. While originally developed using cattle data, the framework can be applied to other species. The calculation of the FAETH score is based on the genome-wide restricted maximum likelihood (GREML) analysis using multiple Genomic Relationship Matrix (GRM) made from different sets of sequence variants partitioned by external functionally and/or evolutionary data (https://www.biorxiv.org/content/10.1101/601658v2). The construction of GRM and GREML analysis used GCTA (https://cnsgenomics.com/software/gcta/#Overview) and/or MTG2 (https://sites.google.com/site/honglee0707/mtg2). These software are fully open-access and well-documented and thus, this tutorial focuses on the FAETH computation after heritability estimates are obtained for each GRM.

The demon data ‘faeth.demo.tar.gz‘ can be downloaded at https://melbourne.figshare.com/s/f42b718e81e63dc488ac. In the unzipped folder of ‘faeth.demo’, there are four text files: demo.allgrm.h2.sum.txt, demo.grm.h2.sum.txt, demo.size.grm.txt and mtg2.grm01.tr01.hsq and a sub-folder ‘snplist/‘. These data are needed to calculate FAETH score using the R script ‘faeth.demo.R‘ in the folder and will be demonstrated in what follows.

mtg2.grm01.tr01.hsq is an example of GREML output from MTG2 for grm01 (1st GRM) fitted with hd grm (2nd GRM) for trait 01 and its header looks like:

     Ve      0.3436      0.0049

     Va      0.0050      0.0023

     Va      0.6150      0.0422

     h2      0.0052      0.0024

     h2      0.6382      0.0172

The bottom 2 lines of h2 are the heritability estimates for the first GRM and the HD GRM with their standard errors. 2 GRMs are fitted together for trait 01. Clearly in this case HD GRM sucked up all the variance, suggesting that the functional /evolutionary GRM is not so important.

If you use approached described in Xiang et al 2019 to estimate FAETH score, you will have multiple files like this across different GRMs and different traits. You could use some simple shell-loop script to glue all results together, for example:

for grmNum in 01
#if multiple grms: for grmNum in {01..50}
do for trNum in 01
#if multiple traits: for trNum in {01..10}
do paste -d ' ' <(paste -d ' ' <(echo grm${grmNum}) <(echo tr${trNum}) |cat - <(paste -d ' ' <(echo hd) <(echo tr${trNum}))) <(grep mtg2.grm${grmNum}.tr${trNum}.hsq  -e 'h2'|awk '{print$2,$3}')
done
done

By looping through REML results across multiple traits and GRMs, you will get a summary file like demo.grm.h2.sum.txt:

hprs tr01 0.0144 0.0053
hd tr01 0.6135 0.0217
hprs tr02 0.0247 0.0063
hd tr02 0.5611 0.0234
hprs tr03 0.005 0.0038
hd tr03 0.4055 0.0268
...

This file contains summary of heritability for functional/evolutionary GRM that were fitted with the HD GRM. The 2nd column denotes the trait with which the 2 GRMs were fitted together. The 3rd and 4th columns are the heritability estimates and its standard errors, respectively. Data for 10 traits are included. Other functional/evolutionary GRMs fitted together with HD GRM in this demo example included: variants tagged by Chip-Seq peaks (ChipSeq), variants associated with concentrations of metabolites of bovine milk (mqtl), variants associated with RNA splicing (sqtl) and variants that appeared and/or are selected recently (young).

Note 1: for GRMs of hprs, ChipSeq, mqtl, sqtl and young (or any other GRMs that have only 1 category), themselves do not cover the entire genome. For example, the genome can be partitioned into ‘sqtl’ and the ‘rest of sqtl’. In order to calculate the FAETH score for all variants, we need to derive the heritability of the ‘the rest genome’ for each one of these GRMs. This derivation is based on the estimation of the heritability of ‘all variants’. Given that we are estimating additive genetic variance, we subtract the heritability of a targeted GRM (e.g., sqtl) from the heritability of ‘all variants’ (described in following) to derive the heritability of the rest variants.

Note 2: in the file ‘demo.grm.h2.sum.txt’ there are also GRMs labeled as: ‘UTR’, ‘intron’, ‘noncoding.related’, ‘splice’, ‘merged.coding’, ‘intergenic_variant’ and ‘geneend’. These 7 GRMs together make up the genome partition of ‘variant annotation’ and cover the whole genome. Therefore, for this type of GRMs, will not need the heritability of the ‘rest variants’. In the FAETH score calculation, a single vector for the score for the partition ‘annotation’ will be calculated and this single vector will have 7 unique values corresponding to the 7 GRMs made of 7 sets of variants that make up the annotation genome partition.

‘demo.allgrm.h2.sum.txt’ has the same format as ‘demo.grm.h2.sum.txt’, except that instead of functional/evolutionary GRM was fitted together with the HD GRM, all variants GRM (over 17 millions sequence variants in the study of Xiang et al 2019) were fitted with the HD GRM. The header of ‘demo.allgrm.h2.sum.txt’:

allsnp tr01 0.6225 0.0156
hd tr01 0 0.0076
allsnp tr02 0.7076 0.0146
hd tr02 0 0.0085
allsnp tr03 0.6873 0.0134
hd tr03 0 0.0068
...

In the sub-folder of faeth.demo there are lists of variants corresponding to each GRM in this demo. You may have used these lists to create GRMs with GCTA before heritability estimates were obtained. Here we need them to calculate per-variant heritability which is an essential part of the FAETH score. You could use a shell line to generated the GRM size file (already exist in the demo file) corresponding to each GRM:

wc -l snplist/*.snplist|awk '!/total/{print$2,$1}'|sed "s|snplist/||g" > demo.size.grm.txt

The header of ‘demo.size.grm.txt’:

ChipSeq.snplist 1166795
UTR.snplist 42350
allsnp.snplist 17669372
coding.related.snplist 105969
geneend.snplist 1007214
hprs.snplist 169773
...

Now we have finished preparing/describing the files, a R script is provided to calculate FAETH score for the demo data. This script will generate the FAETH score for 12 functional/evolutionary GRMs out of 6 genome partitions (see note 1 and 2) across 10 traits for over 17 millions sequence variants. It is recommended to run the job on an HPC. However, I have tested this pipeline on my laptop (DELL latitude E7270, 16G RAM) and it took several minutes. Therefore, I assume it should work for normal PCs.

Set the working directory to the directory of faeth.demo. Two libraries, ‘data.table’ and ‘reshape2’ will need to be loaded. Note that data.table is handy for process large tables.

#use your own path that contain the directory of faeth.demo
setwd("C:.../faeth.demo")
library(data.table)
library(reshape2)
dat <- read.table('demo.grm.h2.sum.txt',header=F)
dat.all <- read.table('demo.allgrm.h2.sum.txt',header=F)
size <- read.table('demo.size.grm.txt',header=F)
#organise data
dat1 <- merge(size,dat[-(grep('hd',dat[,1])),],by='V1')
colnames(dat1) <- c('grm','size','tr','h2','se')
dat2 <- merge(unique(dat1[,1:2]),data.table::dcast(dat1[,c(1,3,4)],grm~tr),by='grm')
colnames(dat2)[grep('tr',colnames(dat2))] <- paste0(colnames(dat2)[grep('tr',colnames(dat2))],'.h2')

> dat2[1:5,1:5]
                 grm     size tr01.h2 tr02.h2 tr03.h2
1            ChipSeq  1166795  0.2170  0.1976  0.1190
2     coding.related   105969  0.0112  0.0130  0.0037
3            geneend  1007214  0.2144  0.1906  0.0934
4               hprs   169773  0.0144  0.0247  0.0050
5 intergenic_variant 11869145  0.2458  0.3254  0.2612

#define grms need rest grm
pt.list <- c('ChipSeq','hprs','mqtl','sqtl','young')
#define grms form a complete genome partition
anno.list <- c('UTR','coding.related','geneend','intergenic_variant','intron','noncoding.related','splice.site')

#make rest grm for for those ones that needed
dat3 <- dat2[dat2[,1] %in% pt.list,]
dat.all1 <- dcast(dat.all[!grepl('hd',dat.all[,1]),],V1~V2,value.var='V3')
colnames(dat.all1)[grep('tr',colnames(dat.all1))] <- paste0(colnames(dat.all1)[grep('tr',colnames(dat.all1))],'.h2')
dat.all2 <- merge(size,dat.all1,by='V1')
colnames(dat.all2)[1:2] <- c('grm','size')
#do a check
colnames(dat.all2)==colnames(dat3)
#process the heritability of 'rest variants' and combine data
dat4 <- do.call(rbind,apply(dat3[,-1], 1, function(x) dat.all2[,-1]-x))
dat5 <- cbind(dat3[,1],dat4)
colnames(dat5)[1] <- 'grm'
dat5[,1] <- paste0(dat5[,1],'.rest')

#combine all data
all.dat <- rbind(dat2,dat5)
#write.table(all.dat,'h2.raw.res',row.names=F,quote=F)
#calculate per-variant h2
all.dat1 <- all.dat[,-(1:2)]/all.dat$size
all.dat2 <- cbind(all.dat[,1,drop=F],all.dat1)
#calculate trait-wise average heritability
all.dat2$ave.h2 <- rowMeans(all.dat2[,-1],na.rm=T)

The last column of ‘all.dat2’ will be the per-variant heritability for each GRM:

> all.dat2[1:4,c(1:3,ncol(all.dat2))]
             grm      tr01.h2      tr02.h2       ave.h2
1        ChipSeq 1.859795e-07 1.693528e-07 7.782858e-08
2 coding.related 1.056913e-07 1.226774e-07 4.557940e-08
3        geneend 2.128644e-07 1.892349e-07 8.057871e-08
4           hprs 8.481914e-08 1.454884e-07 3.940558e-08
...

From here we will use the par-variant to calculate FAETH score with a R loop and data.table syntax:

listpath <- 'snplist/'
sllist <- list()
for (i in list.files(path=listpath,pattern='*.snplist'))
{sllist[[i]] <- fread(paste0(listpath,i),header=F)
}
faethd <- sllist[['allsnp.snplist']]
colnames(faethd)[1] <- 'SNP'
#do grms that need rest grm
for (grmtype in pt.list)
{faethd[,paste0(grmtype):=ifelse(SNP %in% unlist(sllist[[which(names(sllist)==paste0(grmtype,'.snplist'))]]$V1),all.dat2[all.dat2$grm==grmtype,]$ave.h2,all.dat2[all.dat2$grm==paste0(grmtype,'.rest'),]$ave.h2)]
}
setkey(faethd,SNP)
#do grms that form a complete partition
fdlist <- list()
#only 1 list as example is given, could have more than 1
for (fd in c('anno.list'))
{grmtypelist <- list()
for (grmtype in get(fd))
{grmtypelist[[grmtype]] <- cbind(sllist[[which(names(sllist)==paste0(grmtype,'.snplist'))]],all.dat2[all.dat2$grm==grmtype,]$ave.h2)
}
tmp <- do.call(rbind,grmtypelist)
colnames(tmp) <- c('SNP',gsub('.list','',fd))
setkey(tmp,SNP)
fdlist[[fd]] <- tmp
}

#merge all lists and calculate FAETH score (rwoMeans)
faethd1 <-  Reduce(function(...) merge(...,all=T,by="SNP"),c(list(faethd),fdlist))
faethd1[,faeth.sc:=rowMeans(faethd1[,-1])]

Note that the calculations loop for GRMs that needed or not needed the heritability of ‘rest variants’ are different.

> faethd1[1:6,1:6]
SNP      ChipSeq         hprs         mqtl         sqtl        young
Chr10:100000072 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000080 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000085 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000144 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000208 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
Chr10:100000212 2.694246e-08 3.021441e-08 3.009566e-08 1.477679e-08 3.101628e-08
#=====save calculated FAETH data=====
#write.table(faethd1,gzfile('demo.faeth.txt.gz'),row.names=F,quote=F)

To this end we have calculated the FAETH score for over 17M variants using heritability estimates from 12 GRMs (6 genome partitions) across 10 traits. The idea of FAETH score is to average the par-variant heritability of each variant across all functional categories. By doing this, each variant has a single vector of ranking for importance of functionality, evolution and trait heritability. This single vector variant-ranking can be easily used with other analysis, such as a biological prior for BayesRC (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2443-6). With more categories of functional/evolutionary data being included, FAETH score will be more accurate.

Related

Leave a comment Cancel reply