昆虫学报 ›› 2024, Vol. 67 ›› Issue (3): 346-357.doi: 10.16380/j.kcxb.2024.03.005

• 研究论文 • 上一篇    下一篇

利用中华蜜蜂工蜂幼虫肠道转录组纳米孔长读段数据完善东方蜜蜂参考基因组序列和功能注释

李坤泽1,#, 宋宇轩1,#, 臧贺1, 荆欣1, 范小雪1, 陈颖1那志豪1, 陈大福1,2,3, 付中民1,2,3,*, 郭睿1,2,3,*   

  1. (1. 福建农林大学蜂学与生物医药学院, 福州 350002; 2. 天然生物毒素国家地方联合工程实验室, 福州 350002; 3. 福建省蜂疗研究所, 福州 350002)
  • 出版日期:2024-03-20 发布日期:2024-04-17

Improvement of the sequences and functional annotations of the Apis cerana reference genome with the nanopore long-read data of the gut transcriptome of larval A. cerana cerana workers

LI Kun-Ze 1,#, SONG Yu-Xuan 1,#, ZANG He 1, JING Xin1, FAN Xiao-Xue1, CHEN Ying1, NA Zhi-Hao1, CHEN Da-Fu1,2,3, FU Zhong-Min1,2,3,*, GUO Rui1,2,3,*   

  1. (1. College of Bee Science and Biomedicine, Fujian Agriculture and Forestry University, Fuzhou 350002, China; 2. National & Local United Engineering Laboratory of Natural Biotoxin, Fuzhou 350002, China; 3. Apicultural Research Institute of Fujian Province, Fuzhou 350002, China)
  • Online:2024-03-20 Published:2024-04-17

摘要: 【目的】将已获得的中华蜜蜂Apis cerana cerana转录组纳米孔长读段数据比对到东方蜜蜂A. cerana参考基因组,进行注释基因的结构优化,鉴定未注释的新基因和新转录本并进行功能注释以及预测其SSR位点、完整ORF和转录因子(transcription factor, TF)家族及成员的分析验证,完善现有的东方蜜蜂参考基因组序列和功能注释。【方法】基于已获得的高质量的接种蜜蜂球囊菌Ascosphaera apis的中华蜜蜂工蜂4, 5和6日龄幼虫肠道转录组纳米孔测序数据,使用gffcompare软件将已鉴定到的全长转录本比对到东方蜜蜂参考基因组以优化已注释基因的结构;采用gffcompare软件鉴定参考基因组上未注释的新基因和新转录本,再通过比对Nr, KOG, eggNOG, GO和KEGG数据库进行功能注释;使用MISA, TransDecoder v3.0.0和animalTFDB 2.0软件分别预测SSR位点、完整ORF和TF家族及成员。【结果】共对东方蜜蜂参考基因组上已注释的4 648个基因结构进行了优化,对1 336个基因同时延长了5′UTR和3′UTR,分别延长了1 688个基因的5′UTR和1 624个基因的3′UTR;共鉴定到2 148个新基因,其中分别有818, 298, 587, 359和333个新基因可注释到Nr, KOG, eggNOG, GO和KEGG数据库;共鉴定到35 432条新转录本,其中分别有30 974, 21 222, 29 025, 19 852和9 214条新转录本可注释到上述5个数据库;共发掘出22 541个SSR位点,其中单、双、三和六碱基重复的SSR数量分别为12 078, 7 140, 2 825和43个,混合SSR的数量为2 964个,分布频率最高的类型是单碱基重复(153.37个/Mb);共预测到58个TF家族及1 611个成员;共预测出28 775个完整ORF,其中编码长度分布在100~200个氨基酸的ORF(38.99%)最多。【结论】研究结果优化了东方蜜蜂参考基因组上已注释基因的结构,并补充了参考基因组上未注释的新基因、新转录本、SSR、完整ORF及TF。

关键词: 东方蜜蜂, 中华蜜蜂, 第三代测序技术, 纳米孔测序, 全长转录本, 转录组, 基因组

Abstract:  【Aim】 The obtained nanopore long-read data of Apis cerana cerana transcriptome were compared with the reference genome of A.cerana, and the structures of the annotated genes were optimized. The unannotated new genes and new transcripts were identified and functionally annotated, and their SSR loci, complete ORFs and transcription factor (TF) families and members were predicted and verified, so as to improve the sequence and functional annotations of the reference genome of A. cerana. 【Methods】 Based on the high-quality transcriptome nanopore sequencing data of the 4-, 5- and 6-day-old larvae of A. cerana cerana workers infected with Ascosphaera apis, the identified full-length transcripts were mapped to the reference genome of A. cerana with gffcompare software to optimize the structures of the annotated genes. The unannotated novel genes and transcripts in the reference genome were identified utilizing the gffcompare software and mapped to the Nr, KOG, eggNOG, GO and KEGG databases for functional annotation. MISA, TransDecoder v3.0.0 and animalTFDB 2.0 software were employed to respectively predict the SSR loci, complete ORFs as well as TF families and members. 【Results】 A total of 4 648 annotated genes in the reference genome of A. cerana were structurally optimized, the 5′UTR and 3′UTR of 1 336 genes were simultaneously extended, while the 5′UTR of 1 688 genes and the 3′UTR of 1 624 genes were respectively extended. A total of 2 148 novel genes were identified, among which 818, 298, 587, 359 and 333 genes could be annotated to Nr, KOG, eggNOG, GO and KEGG databases, respectively. A total of 35 432 novel transcripts were identified, among which 30 974, 21 222, 29 025, 19 852, and 9 214 could be respectively annotated to the aforementioned five databases. A total of 22 541 SSR loci were detected, of which the numbers of SSRs with single, double, three and six base repeat were 12 078, 7 140, 2 825 and 43, respectively. The number of mixed SSRs was 2 964, and the type with the highest distribution frequency was single base repeat (153.37/Mb), and 58 TF families and 1 611 members were predicted. A total of 28 775 complete ORFs were predicted, of which the ORFs with the coding lengths ranging from 100 to 200 aa (38.99 %) were the most abundant. 【Conclusion】 These results optimize the structures of the annotated genes in the A. cerana reference genome and supplement novel genes, novel transcripts, SSR, complete ORFs, and TFs that were unannotated in the reference genome.

Key words: Apis cerana, A. cerana cerana, 3rd-generation sequencing technology, nanopore sequencing, full-length transcript, transcriptome, genome