Peak calling

Introduction

In this tutorial we will use MACS to call peaks.

How-to install MACS

MACS2 runs in python (v2.7, not 3). You will need to download and install python and numpy.

Using the Terminal, navigate to the folder containing the numpy.whl file and then install numpy by entering:

pip install numpy.whl

  • Download the tarfile for MACS, extract it (by double-clicking on Mac)

  • Navigate to the folder containing the extracted MACS files in Terminal and run:

python setup.py install

MACS2 parameters

-t: treatment file
-c: control file
--format: the input file format, e.g.. BAM
--gsize: the size of the genome (or chromosome)
--name: a name to append to the output files
--bw: ‘bandwidth’ the average size of the DNA fragments (after sonication)

You can find out more information about MACS2 parameters on the Github page, or by typing macs2 into the Terminal command line.

Using MACS2 to call peaks

To run MACS you will need to navigate to the folder containing your BAM alignment files. Using the cd command to change directory.

From here you need to call macs2 callpeak with the parameters you wish to use. If you have multiple replicates you can either call peaks with them both at the same time or separately. Peak calling works with and without a control sample.

macs2 callpeak -t treatment_rep1.bam treatment_rep2.bam -c control_rep1.bam control_rep2.bam --format BAM --gsize <genome_size> --name "rep1andrep2" --bw 400 --nomodel

In this example macs 2 call peak function is called on two replicate treatment files BAM files, with 2 control files. The format is specified as BAM, replace <genome_size> with the size of the genome/chromosome, for example: 2961149 for chromosome I of V.cholerae. The bandwidth is set to 400 and have told macs not to use its model function.

The function prints to the console as it is working and when it is complete you will see Done!. It produces an excel sheet (which is actually a .csv file), some .BED files, .narrowPEAK (which is tab-delimited list of peaks) and some logs.

They all contain the same information, but the excel sheet is probably the most user-friendly. BED files can be opened in some genome browsers.

I recommend moving the MACS output into a macs folder within your project structure, rather than keeping the output files with your alignments.

Using peak information into R

Why import peaks?

You can easily import all the information in the excel sheet into R. This can then be used for visualisation, finding the nearest genes, calculating the distance from peak centres to the nearest transcription start site etc..
You can use R to calculate the centre of the peak (based on the start and end, recognised by MACS) or if you wish to use a genome browser to manually select peak centres you could do this by amending the data in Excel.

This can then be used to extract the DNA sequence upstream/downstream of the peak centres in .fasta format to perform de-novo motif finding with MEME.

Reading peak data into R

Since the .xls file is actually a .csv file in disguise it can be easily imported into R. The file also contains ~20 lines of information before the peak data, these lines start with a # so we can tell R to ignore those lines.

peaks_dataFrame <- read.delim("/macs/rep1andrep2.xls",
  comment.char = "#")

Finding genes adjacent to peaks

To find the genes that are adjacent to the peaks we will use the ChIPseeker package.

Install ChIPseeker

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ChIPseeker", version = "3.8")

Create a TxDB object

A TxDB object contains genome annotation information that you will need to annotate peaks with genomic information. It is very simple to make, once you have done you can save and reload it whenever you wish to use it.

  • Download a GFF or GTF annotation file (from NCBI) and keep it in your genomes folder.
library(GenomicFeatures)
library(AnnotationDbi)
<genomeName>TxDB <- makeTxDbFromGFF("genomes/annotation<genomeName>.gff")
saveDb(<genomeName>TxDB, file="genomes/<genomeName>TxDB.satellite")
  • Loading the TxDB file (replacing , with a name for the genome you are using):
<genome>Annotation <- loadDb("genomes/<genomeName>TxDB.sqlite")

Annotate peaks

You will need to convert your data frame containing the peak data into a GRange object, this can easily be done with the makeGRangesFromDataFrame() function from the GenomicRanges package.

peaks_gr <- makeGRangesFromDataFrame(peaks_dataFrame, keep.extra.columns=TRUE)

By default the function removes any extra columns and only keeps chr, start and end. Setting the parameter keep.extra.columns to TRUE will prevent this.

Then you can use the annotatePeak() function. You will need to direct it to your GRange and to a TxDb object that contains the genome annotation for the species you are working with.

library(ChIPseeker)
annotatedPeaks <- annotatePeak(peaks_gr, TxDb = <genome>Annotation)

Making a data frame with your annotated data makes it more useful, and you can export it as .csv file to save and open later in R, Excel or in other software.

annotatedPeaksDF <- as.data.frame(annotatedPeaks)
write.csv(annotatedPeaksDF, file = "annotation/rep1andrep2.csv")
Tom Guest
Tom Guest
Postdoctoral researcher