昆虫学报 ›› 2020, Vol. 63 ›› Issue (11): 1345-1357.doi: 10.16380/j.kcxb.2020.11.007

• 研究论文 • 上一篇    下一篇

基于蜜蜂球囊菌纳米孔测序数据的基因非翻译区延长、SSR位点发掘及未注释基因和转录本鉴定

杜宇1,#, 付中民1,#, 祝智威1, 王杰1, 冯睿蓉1, 王秀娜2,3, 蒋海宾1范元婵1, 范小雪1, 熊翠玲1, 郑燕珍1, 徐国钧1, 陈大福1, 郭睿1,*   

  1. (1. 福建农林大学动物科学学院(蜂学学院), 福州 350002; 2. 福建农林大学生命科学学院, 福州 350002; 3. 福建农林大学, 福建省病原真菌与真菌毒素重点实验室, 福州 350002)
  • 出版日期:2020-11-20 发布日期:2020-12-08

Elongation of genic untranslated regions, exploration of SSR loci and identification of unannotated genes and transcripts based on the nanopore sequencing dataset of Ascosphaera apis

DU Yu1,#, FU Zhong-Min1,#, ZHU Zhi-Wei1, WANG Jie1, FENG Rui-Rong1, WANG Xiu-Na2,3JIANG Hai-Bin1, FAN Yuan-Chan1, FAN Xiao-Xue1, XIONG Cui-Ling1, ZHENG Yan-Zhen1, XU Guo-Jun1, CHEN Da-Fu1, GUO Rui1,*   

  1.  (1. College of Animal Sciences (College of Bee Science), Fujian Agriculture and Forestry University, Fuzhou 350002, China; 2. College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China; 3. Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou 350002, China)
  • Online:2020-11-20 Published:2020-12-08

摘要: 【目的】利用已获得的纳米孔长读段测序数据完善现有的蜜蜂球囊菌Ascosphaera apis参考基因组注释信息,并对未注释的新基因和新转录本进行鉴定和功能注释。【方法】基于已获得的纳米孔长读段测序数据,采用gffcompare软件将蜜蜂球囊菌全长转录本与参考基因组注释的转录本进行比较,进而对参考基因组注释基因的非翻译区(untranslated region, UTR)进行延长。利用TransDecoder软件对蜜蜂球囊菌基因的开放阅读框(open reading frame, ORF)及相应的氨基酸序列进行预测。通过MISA软件发掘长度在500 bp以上的全长转录本的SSR位点。通过Blast工具将鉴定到的新基因和新转录本比对Nr, KOG, eggNOG, Swiss-Prot, Pfam, GO和KEGG数据库进行功能注释。【结果】共对蜜蜂球囊菌的9 481个基因进行了UTR延长,其中5′UTR和3′UTR延长的基因分别有4 744和4 737个。共预测出10 492个完整ORF,其中编码长度分布在0~100和100~200个氨基酸的ORF最多,分别占ORF总数的38.96%和36.90%。共鉴定到5 286个SSR,其中单核苷酸重复、二核苷酸重复、三核苷酸重复、四核苷酸重复、五核苷酸重复和六核苷酸重复的SSR分别为1 870, 826, 2 398, 138, 43和11个。共鉴定到1 558个新基因,其中有1 556, 731, 330, 592, 1 177, 709和589个新基因可分别被注释到Nr, Swiss-Prot, Pfam, KOG, eggNOG, GO和KEGG数据库。此外,还鉴定到14 403条新转录本,其中有14 376, 8 524, 7 276, 7 405, 12 035, 7 891和6 855条新转录本可分别被注释到上述7个数据库。【结论】本研究利用已获得的纳米孔长读段测序数据对蜜蜂球囊菌的完整ORF进行了预测,对参考基因组的已注释基因进行了UTR延长,对未注释的SSR位点进行了发掘,此外还鉴定到大量未注释的新基因和新转录本,并对它们进行了功能注释。研究结果较好地完善了现有的蜜蜂球囊菌的基因组注释,为其组学和分子生物学研究的深入开展提供了基础。

关键词:  蜜蜂球囊菌, 长读段测序技术, 全长转录组, 基因组, 蜜蜂, 白垩病

Abstract: 【Aim】 This study aims to improve the annotation information of the current reference genome of Ascosphaera apis by utilizing previously gained nanopore long-read sequencing data, and to identify and perform functional annotation of unannotated novel genes and novel transcripts. 【Methods】 Based on the previously gained nanopore long-read sequencing data, full-length transcripts of A. apis were compared with transcripts annotated in the reference genome using gffcompare software to prolong untranslated regions (UTRs). The open reading frames (ORFs) of genes in A. apis and their corresponding amino acid sequences were predicted using TransDecoder software. MISA software was used to survey simple sequence repeat (SSR) loci within transcripts with a length above 500 bp. Based on Blast tool, novel genes and novel transcripts were aligned to the Nr, KOG, eggNOG, Swiss-Prot, Pfam, GO and KEGG databases to gain their corresponding functional annotations. 【Results】 Totally, UTRs of 9 481 genes in A. apis were prolonged, among which 4 744 and 4 737 genes were prolonged at 5′UTR and 3′UTR, respectively. In addition, 10 492 complete ORFs were predicted, among which the ORFs encoding proteins distributed in 0-100 aa and 100-200 aa in length were the most abundant, accounting for 38.96% and 36.90% of the total ORFs, respectively. A total of 5 286 SSRs were identified, and the numbers of mononucleotide repeats, dinucleotide repeats, trinucleotide repeats, tetranucleotide repeats, pentanucleotide repeats and hexanucleotide repeats were 1 870, 826, 2 398, 138, 43 and 11, respectively. Besides, 1 558 novel genes were identified, among which 1 556, 731, 330, 592, 1 177, 709 and 589 were annotated to the Nr, Swiss-Prot, Pfam, KOG, eggNOG, GO and KEGG databases, respectively. Additionally, 14 403 novel transcripts were identified, among which 14 376, 8 524, 7 276, 7 405, 12 035, 7 891 and 6 855 were respectively annotated to the aforementioned seven databases. 【Conclusion】 By using the previously obtained nanopore long-read sequencing data, the complete ORFs of genes in A. apis has been predicted, the UTRs of annotated genes in reference genome have been elongated, the SSR loci have been explored, and a number of unannotated novel genes and novel transcripts have been identified and their functions annotated. These findings well improve the current genome annotation of A. apis, and offer a basis for further study on its omics and molecular biology.

Key words: Ascosphaera apis, long-read sequencing technology, full-length transcriptome; genome, honeybee, chalkbrood