An Overview of ac4C-seq: Computational Tools and Databases

With the rise of epigenomics, RNA modification has been widely concerned as the key layer of gene expression regulation, among which N4-acetyl cytidine (ac4C) has become a research hotspot because of its important role in maintaining mRNA stability, regulating translation efficiency, and disease occurrence. Analyzing the dynamic modification mode, functional mechanism, and clinical significance of ac4C is inseparable from the deep integration of high-throughput sequencing technology and calculation methods.

In recent years, the maturity of experimental technologies such as ac4C-seq has spawned massive data and promoted the rapid development of computing tools and databases—from the storage and quality control of original sequencing data, to the accurate prediction of modification sites, to the integration and visualization of multi-group data, forming a complete analysis system covering the data life cycle.

These resources not only provide technical support for researchers to process ac4C data efficiently, but also build a standardized framework for cross-laboratory data sharing and verification. However, the current field still faces challenges such as high data heterogeneity, limited generalization ability of prediction models, and a lack of multi-modification integration analysis tools.

This article provides a detailed overview of computational tools and databases for ac4C-seq, and community standards for data reporting, along with their roles, limitations, and future prospects.

Public ac4C-seq Datasets

The public ac4C-seq data set is the core resource to analyze the panorama of N4-acetyl cytidine modification. With the development of technology in high-throughput sequencing, GEO, ENCODE, and other platforms have accumulated a large number of cross-species and multi-condition ac4C map data, which provides a basis for exploring its role in gene regulation and disease association. The integration and standardization of these data is the key support to promote the study of epigenetics.

Main Data Resources

At present, public ac4C-seq data sets are mainly stored in some well-known data resource platforms in high-throughput sequencing, including GEO(Gene Expression Omnibus), ENCODE (Encyclopedia of DNA elements), and SRA(Sequence Read Archive).

  • GEO is a public gene expression database maintained by the National Center for Biotechnology Information (NCBI), which contains a large number of high-throughput sequencing data from different species, different tissues and cell types, and under different experimental conditions, including a lot of ac4C-seq data. Researchers can retrieve related ac4C-seq data sets through GEO and obtain information such as original data and preprocessed expression data.
  • The ENCODE project aims to comprehensively analyze the functional elements of the human genome. Its database contains abundant epigenomics and transcriptomics data, and ac4C-seq data is also included as an important part of epigenomics. The data in the ENCODE database is highly reliable and comparable after strict quality control and standardization.
  • SRA is a database under NCBI, which specially stores the original data of high-throughput sequencing, covering almost all types of high-throughput sequencing data, including ac4C-seq data. Researchers can download the original sequencing data from SRA, reanalyze and mine them to obtain new research findings.

The working process of RMBase v3.0 (Xuan et al., 2024) The workflow of RMBase v3.0 (Xuan et al., 2024)

Quality Control Index

In order to ensure the quality of ac4C-seq data, certain quality control indicators should be followed in the process of data submission and use. Common quality control indicators include sequencing depth, mapping rate, base mass value, repetition rate, and so on.

  • The sequencing depth refers to the ratio of the total number of bases obtained by sequencing to the genome size, and sufficient sequencing depth is the key to ensuring the detection of low-abundance ac4C modification sites. Generally speaking, the sequencing depth of ac4C-seq data needs to reach a certain level to ensure the reliability and accuracy of the data.
  • Mapping rate refers to the proportion of sequencing reads successfully compared to the reference genome. A higher mapping rate indicates that the sequencing data has better specificity and less pollution. If the mapping rate is too low, it may mean that there is a problem with the sequencing data, which needs further inspection and processing.
  • The base mass value reflects the probability that each base is correctly identified in the sequencing process. The higher the base mass value, the higher the sequencing accuracy of the base. In the process of data processing, reads with low base mass values are usually filtered out to improve the quality of the data.
  • Repetition rate refers to the proportion of repeated reads in sequencing data. Too high a repetition rate may be caused by excessive PCR amplification, which will affect the accuracy and reliability of the data and needs to be properly processed.

Data distribution in a public data repository based on sequencing quality (Ohta et al., 2017) Data distribution in a public data repository by sequencing quality (Ohta et al., 2017)

Prediction Algorithms for ac4C Sites

ac4C is a key RNA modification, and its site identification is very important for the analysis function. Experimental sequencing technology has high cost and limited throughput, and it is difficult to meet the needs of full transcriptome research. Therefore, it is necessary to develop an efficient ac4C site prediction algorithm. By mining sequence features and modification rules, low-cost and Qualcomm-based site prediction can be achieved, which can provide accurate targeting for subsequent functional verification and promote the study of epigenetics.

Traditional Machine Learning Model

Before the appearance of deep learning models, traditional machine learning models played an important role in ac4C locus prediction. Common traditional machine learning models include support vector machine (SVM), random forest (RF), and logistic regression (LR). These models usually need to extract the features of RNA sequences, such as k-mer features, physical and chemical properties, and then input the extracted features into the model for training and prediction.

  • SVM can classify samples by finding the optimal hyperplane, which has good generalization ability.
  • RF is composed of multiple decision trees, and the accuracy and stability of prediction are improved by ensemble learning.
  • LR predicts the probability of events by establishing a linear regression model, which is simple and easy to understand and has high calculation efficiency.

However, the traditional machine learning model has some limitations in dealing with high-dimensional and complex biological data, and its feature extraction process often requires manual intervention, which may miss some important feature information, thus affecting the prediction performance.

Deep Learning Model

In recent years, deep learning technology has been widely used in the field of bioinformatics, and remarkable progress has been made in the prediction of ac4C loci.

  • DeepAc4C is an ac4C site prediction model based on deep learning. It uses the powerful feature learning ability of a deep neural network to extract features from the primary structure and secondary structure of RNA sequence, so as to accurately predict ac4C sites.
  • The DeepAc4C model usually consists of an input layer, multiple hidden layers, and an output layer. The input layer receives the coding information of the RNA sequence, the hidden layer extracts and screens the features through a convolution operation and pooling operation, and the output layer predicts whether the site is an ac4C modification site.
  • Compared with the traditional machine learning model, DeepAc4C has stronger nonlinear fitting ability and generalization ability, and can handle more complex feature relationships, thus improving the accuracy of prediction.

A large number of experimental results show that the prediction performance of DeepAc4C is better than that of the traditional machine learning model on multiple data sets, and it can identify ac4C modification sites more accurately, which provides a powerful tool for the research of ac4C modification.

Predictive performance of quality features and machine learning (ML) (Albrecht et al., 2021) Predictive performance of quality features and machine learning (ML) (Albrecht et al., 2021)

Integrative Epitranscriptome Browsers

Integrating the Epitranscriptome Browser is the core tool to analyze the panoramic regulation of RNA modification. It integrates multi-dimensional apparent transcriptome data (such as ac4C, m6A and other modification maps) and genome annotation information, and realizes the location, correlation analysis and dynamic change display of modification sites through a visual interface, which provides an intuitive and efficient research platform for revealing the synergistic regulation mechanism of RNA modification and analyzing its role in physiological and pathological processes.

UCSC Genome Track integration

  • UCSC Genome Browser is a widely used genome visualization tool that allows researchers to view all kinds of annotation information and sequencing data at the genome level. In recent years, ac4C-seq data have also been integrated into UCSC Genome Track, which is convenient for researchers to visually analyze ac4C modification sites.
  • Through the UCSC Genome Browser, researchers can display the ac4C-seq data superimposed with other apparent transcriptome data (such as m6A-seq data, ψ-SEQ data, etc.) and genome annotation information (such as gene structure, promoter region, etc.), and intuitively observe the distribution of ac4C modification sites on the genome and their association with other genome characteristics. This is helpful for researchers to better understand the biological function and regulation mechanism of ac4C modification.

Comparison with RMVar and MeT-DB

In addition to the UCSC Genome Browser, there are some specialized epigenome databases and browsers, such as RMVar and MeT-DB, which are different from UCSC Genome Track, which integrates ac4C-seq data in data content, function, and application.

  • RMVar is a database focusing on the variation of RNA modification, which contains the variation information of various RNA modifications (including ac4C) and the relationship between these variations and diseases. Researchers can query the variation of specific RNA modification sites through RMVar and analyze their influence on RNA structure and function.
  • MeT-DB is a comprehensive database of RNA methylation, which mainly contains the data related to methylation modification of m6A and other RNAs, including modification sites, modification enzymes, related genes, and other information, and provides functions such as data query, visualization and analysis.
  • Compared with RMVar and MeT-DB, UCSC Genome Track, which integrates ac4C-seq data, pays more attention to the visualization of the genome level, and can analyze the association between ac4C modification sites and other features of the genome. While RMVar and MeT-DB are more focused on the detailed information and correlation analysis of specific types of RNA modification. Researchers can choose appropriate tools and databases according to their own research needs.

The selected deep learning model (DL-ac4C) that achieved the optimal performance on the validation set, mRNA (Iqbal et al., 2022) The chosen deep learning model (DL-ac4C) with the best performance on the validation set, mRNA (Iqbal et al., 2022)

Community Standards for ac4C Data Reporting

The community standard of ac4C data reporting is the cornerstone of ensuring research repeatability and data integration. With the popularization of ac4C-seq technology, the differences in data format, quality control standards, and meta-information description hinder cross-research comparison and mining. It is very important to establish unified norms (such as MIAME-style guidelines) to standardize experimental design, data processing, and results presentation, so as to promote the mechanism research and clinical transformation of ac4C modification.

  • A. MIAME-style Guide
    • a) In order to ensure the repeatability and comparability of ac4C-seq data, it is necessary to establish a unified data reporting standard. Miame (minimum information about a microarray experiment) is the minimum information reporting standard for microarray experiments, which provides guidance for microarray data reporting. In the field of epigenetics, the researchers used MIAME's idea for reference, and put forward MIAME-style guidelines for epigenetics data (including ac4C-seq data).
    • b) The guideline specifies the minimum information to be included when reporting ac4C-seq data, such as experimental design, sample information, sequencing platform, data processing methods, quality control indicators, acquisition methods of original data and processed data, etc. Following these guidelines can enable other researchers to repeat experiments and verify the results, and promote the exchange and cooperation of ac4C research. Specifically:
    • c) The experiment design part needs to describe the purpose, grouping, and processing conditions of the experiment in detail.
      • Sample information includes the source, species, tissue type, cell line, culture conditions, etc.
      • The sequencing platform needs to explain the sequencing instruments and sequencing strategies used.
      • The data processing method needs to describe the analysis process and software tools used from the original data to the final result.
      • Quality control indicators need to report sequencing depth, mapping rate, base quality value, etc.
      • The original data and processed data need to provide accessible storage locations, such as access numbers of GEO, SRA, and other databases.

A performance comparison of various machine learning algorithms on independent test datasets (Jia et al., 2023) erformance comparison of different machine learning algorithms on independent test datasets (Jia et al., 2023)

Conclusion

To sum up, the common ac4C-seq data set, ac4C locus prediction algorithm, integrated apparent transcriptome browser, and community standard of ac4C data reporting have played an important role in ac4C research. Public data resources provide researchers with rich data sources, and quality control indicators ensure the reliability of data. The prediction algorithm can help researchers identify ac4C modification sites quickly and accurately. Integrating the apparent transcriptome browser facilitates the visualization and analysis of data; Community standards promote the sharing and exchange of data.

However, there are still some shortcomings in these computing tools and databases. For example:

  • The number and coverage of public ac4C-seq data sets are not wide enough, especially in different species and disease models.
  • The performance of the prediction algorithm needs to be further improved, especially when dealing with complex RNA structures and multiple modifications coexist.
  • The function of the integrated epigenome browser can be further improved to meet the more diverse analysis needs of researchers.
  • The implementation of data reporting standards needs to be strengthened to ensure the quality and repeatability of data.

The schematic illustration of STM-ac4C (Yi et al., 2024) The schematic diagram of STM-ac4C (Yi et al., 2024)

In the future, with the deepening of ac4C research, we can expect more high-quality public ac4C-seq data sets to appear, the performance of prediction algorithms will be continuously improved, the integrated apparent transcription group browser will be more intelligent and personalized, and the data reporting standards will be more perfect and unified. These advances will further promote the development of ac4C research and provide more powerful support for revealing the biological function of ac4C modification and its application in disease diagnosis and treatment.

References

  1. Xuan J, Chen L, Chen Z, et al. "RMBase v3.0: decode the landscape, mechanisms and functions of RNA modifications." Nucleic Acids Res. 2024 52(D1): D273-D284.
  2. Ohta T, Nakazato T, Bono H. "Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive." Gigascience. 2017 6(6): 1-8.
  3. Albrecht S, Sprang M, Andrade-Navarro MA, Fontaine JF. "seqQscorer: automated quality control of next-generation sequencing data using machine learning." Genome Biol. 2021 22(1): 75.
  4. Iqbal MS, Abbasi R, et al. "Recognition of mRNA N4 Acetylcytidine (ac4C) by Using Non-Deep vs. Deep Learning." Applied Sciences. 2022 12(3): 1344.
  5. Jia J, Cao X, Wei Z. "DLC-ac4C: A Prediction Model for N4-acetylcytidine Sites in Human mRNA Based on DenseNet and Bidirectional LSTM Methods." Curr Genomics. 2023 24(3): 171-186.
  6. Yi M, Zhou F, Deng Y. "STM-ac4C: a hybrid model for identification of N4-acetylcytidine (ac4C) in human mRNA based on selective kernel convolution, temporal convolutional network, and multi-head self-attention." Front Genet. 2024 15: 1408688.
! For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Related Services
x
Online Inquiry