R语言实现对基因组SNV进行注释
2024-10-19 03:38:02
很多时候,我们需要对取出的SNV进行注释,这个时候可能会在R上进行注释,通常注释文件都含有Chr(染色体)、Start(开始位点)、End(结束位点)、Description(描述),而我们的SNV文件通常是拥有Position(位置),因此我们可以先定位Chr,再用Postion去定位到Start和End之间,找到相对应的Description。为了加快速度,可以使用二分查找法。
for (value in dt$value){
#df:data.frame, V1 and V2 should be Start and End value: Postition used to find region return:df row number where position locates ,if no region return -
low=
high=nrow(df)
mid=high %/%
if (df[low,] <= value & value <= df[low,]) low
else if (df[high,] <= value & value <= df[high,]) high
else{
while (value > df[mid,] || value < df[mid,]){
if (value > df[mid,]){
low = mid+
} else if (value < df[mid,]) {
high = mid -
}
if(high<low){
mid=-;break
}
mid=(low+high)%/%
}
mid
}
}
在R中使用for循环效率低,因此也可以用data.table包的foverlap函数,改进代码如下,对bed文件进行注释,如果要对snv进行注释,只需要将snv改成相应的start和end相等的bed文件即可。
#!/bin/Rscript library(data.table) arg <- commandArgs(T)
if (length(arg) != ) {
message("[usage]: BedAnnoGene.R bedfile gtffile outputfile")
message(" bedfile format: chr start end information(Arbitrary but can not be lacked)")
message(" GTFfile: gtf file downloaded from GENCODE")
message(" outputfile: file to be writen out")
message(" needed package: data.table 1.10.4")
stop("Please check your arguments!")
} bedfile <- arg[]
annofile <- arg[]
outfile <- arg[] #read file
anno <- fread(annofile,sep="\t",header=F)
bed <- fread(bedfile,sep="\t",header=F)
setnames(anno,c("V1","V2","V3","V4","V5","V9"),c("Chr","Gene","Type","Start","End","Info"))
anno <- anno[Type=="gene",.(Chr,Start,End,Gene=sapply(strsplit(tstrsplit(Info,";")[][[]],"\""),function(x)x[]))]
setkey(anno,Chr,Start,End)
setkey(bed,V1,V2,V3) #find overlaps by Chr
lst <- list()
for (ChrI in intersect(unique(bed$V1),unique(anno$Chr))){
anno_reg <- anno[Chr == ChrI,.(Start,End)]
bed_reg <- bed[V1 == ChrI,.(V2,V3)]
setkey(anno_reg,Start,End)
setkey(bed_reg,V2,V3)
overl <- foverlaps(bed_reg,anno_reg,which=TRUE,nomatch = )
if (nrow(overl) > ){
lst[[ChrI]] <- data.table(Chr=ChrI,bed[V1 == ChrI,][overl[["xid"]],.(V2,V3,V4)],anno[Chr == ChrI][overl[["yid"]],.(Gene)])
}
}
merge_dt <- rbindlist(lst)
setnames(merge_dt,c("V2","V3","V4"),c("Start","End","Name")) #if one region has more than one gene
torm <- list()
for (i in :(nrow(merge_dt)-)){if(merge_dt[i,"Name"]==merge_dt[i+,"Name"]){set(merge_dt,i+1L,ncol(merge_dt),paste(merge_dt[i,"Gene"],merge_dt[i+,"Gene"],sep=";"));torm <- c(torm,list(i))}}
torm <- unlist(torm)
merge_dt <- merge_dt[-torm,] fwrite(merge_dt,file=outfile)
使用帮助可以在我github看到 https://github.com/yiliu4234/BedAnnoGene
最新文章
- java基础知识(三)java关键字
- 推荐一个Android Studio很实用的插件android-butterknife-zelezny
- Entity Framework浅析
- android 虚线
- c语言中各个类型的sizeof长度
- JAVA 99乘法表实例
- ExtJS 提示
- 求一个全排列函数: 如p([1,2,3])输出:[123],[132],[213],[231],[312],[321]. 求一个组合函数 如p([1,2,3])输出:[1],[2],[3],[1,2],[2,3],[1,3],[1,2,3]
- 2014--9=17 软工二班 MyEclipse blue==3
- stm32通用定时器中断问题
- XML新手入门 创建构造良好的XML(2)
- ContentProvider的一些总结
- 【转】如何在CentOS/RHEL中安装基于Web的监控系统 linux-das
- 使用gradle打包jar包
- Python爬虫入门教程 11-100 行行网电子书多线程爬取
- josn的格式化
- POJ 1236 Network Of Schools 【Targan】+【缩点】
- 《企业IT架构转型之道》读书笔记
- Servlet案例6:显示用户的上次访问时间
- java动手动脑3