Getting started with Bioconductor

Introduction

In this tutorial you will learn how to install and use bioconductor.

Bioconductor is an opensource collection of R packages that provides a framework for doing bioinformatics in R.

Install bioconductor with this:

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install()

Afterwards, installing bioconductor (BioC) packages is a little different from other R packages and makes use of the BiocManager::install() function. For example, to install QuasR:

BiocManager::install("QuasR", version = "3.8")

Bioconductor releases contain a number of R packages that have been designed to perform different tasks or provide the data structures required for interacting with genomic data.

Some examples of BioC packages

Function Example packages
Data structures IRanges, GenomicRanges, Biostrings, BSgenome
Input of data ShortRead, Rsamtools, GenomicAlignments, rtracklayer
Annotation GenomicFeatures, BSgenome, TxDb
Alignment Rbowtie, QuasR, Biostrings
ChIP-seq ChIPseeker, chipseq, ChIPseqR, ChIPpeakAnno, DiffBind, BayesPeak
De-novo motif discovery rGADEM, MotifDb, SeqLogo, ChIPpeakAnno
RNA-seq EdgeR

Packages in bold are ones that are used later.

Creating and using GRanges objects

One of the most useful data structures is GRanges. These are essentially a list of genomic intervals that could be anything from genes to transcription factor binding sites.

Creating simple GRanges

library(GenomicRanges)
## Loading required package: stats4
## Loading required package: BiocGenerics
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Warning: package 'GenomeInfoDb' was built under R version 4.0.5

To start, here is a GRanges object which has 3 genes, all on chromosome 1. The first gene runs from position 1-3, and are 3 nucleotides long.

gr1 <- GRanges(seqnames = "chr1", strand = c("+", "-", "+"),
              ranges = IRanges(start = c(1,3,5), width = 3))
gr1
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-3      +
##   [2]     chr1       3-5      -
##   [3]     chr1       5-7      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

To create a GRange for a single chromosome.

chrI <- GRanges(seqnames = "chrI",
              ranges = IRanges(start = 1, width = 3000000))
chrI
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI 1-3000000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

To create a GRange with ChIP-seq peaks, all on the same chromosome you could assign the start and end co-ordinates to vectors and create a GRange from that.

peak_start <- c(100,220,450,767,899,1040)
peak_end <- c(140,260,490,800,945,1100)
peaks_gr <- GRanges(seqnames = "chrI",
                    ranges = IRanges(start=peak_start, end=peak_end))
peaks_gr
## GRanges object with 6 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI   100-140      *
##   [2]     chrI   220-260      *
##   [3]     chrI   450-490      *
##   [4]     chrI   767-800      *
##   [5]     chrI   899-945      *
##   [6]     chrI 1040-1100      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

GRanges from dataframes

It is more likely that you will want to create GRanges objects from other data structures. For example, if you import peaks from a BED file. Bioconductor packages provide import functions for different filetypes and then this can be coerced into a dataframe and easily converted into a GRanges object with the makeGRangesFromDataFrame function.

Tom Guest
Tom Guest
Postdoctoral researcher