The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that
con <- textConnection("SNP BP
rs064 12292
rs319 345367
rs285 700042")
snps <- read.delim(con, head=TRUE, sep="")
con <- textConnection("Gene BP_start BP_end
E613 345344 363401
E92 694501 705408
E49 362370 368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")
then create ‘IRanges’ out of each
library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene))
(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start – 1). Then find the SNPs (‘query’ in IRanges terminology) that overlap the genes (‘subject’)
olaps <- findOverlaps(isnps, igenes)
two of the SNPs overlap
> queryHits(olaps)
[1] 2 3
and they overlap the first and second genes
> subjectHits(olaps)
[1] 1 2
If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as
> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
SNP BP Gene BP_start BP_end
2 rs319 345367 E613 345344 363401
3 rs285 700042 E92 694501 705408
Usually genes and SNPs have chromosome and strand (‘+’, ‘-‘, or ‘*’ to indicate that strand isn’t important) information, and you’d want to do overlaps in the context of these; instead of creating ‘IRanges’ instances, you’d create ‘GRanges’ (genomic ranges) and the subsequent book-keeping would be taken care of for you
library(GenomicRanges)
isnps <-
with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))