An Encoding Scheme Capturing Generic Priors and Properties of Amino Acids Improves Protein Classification
Zhou, Xinrui1; Yin, Rui1; Zheng, Jie2; Kwoh, Chee-Keong1
2019
Source PublicationIEEE ACCESS
ISSN2169-3536
Volume7Pages:7348-7356
Status已发表
DOI10.1109/ACCESS.2018.2890096
AbstractFeature engineering aims at representing non-numeric data with numeric features that keep the essential information of the underlying problem, and it is a non-trivial process in building a predictive model. In bioinformatics, there is a profound scale of DNA and protein sequences available, but far from being fully utilized. Computational models can facilitate the analyses of large-scale data. However, most computational models require a numeric representation as input. Expert knowledge can help design features to cast the raw symbolic data effectively. But generally, the features vary from case to case and have to be redesigned for a problem. Automated feature engineering, i.e., an encoding scheme automating the construction of features, saves the redesigning process and allows the researchers to try different representations with minimal effort. This is more in line with the explosion of data and the goal of building an intelligent system. In this paper, we introduce an encoding scheme for protein sequences, which encodes the representative sequence dataset into a numeric matrix that can be fed into a downstream learning model. The method, Context-Free Encoding Scheme (CFreeEnS), was proposed for a dataset with labels for pairwise sequences. Here, we improve the method by making it applicable to a batch of protein sequences, requiring no sequence alignment beforehand. The improved method is applied to protein classification at the functional level, including identifying antimicrobial peptides, screening tumor homing peptides, and detecting hemolytic peptides and phage virion proteins. Compared with the traditional methods using task-specific designed features, CFreeEnS improves the predicting accuracy, with an increase ranging from 5.54% to 14.14%. The results indicate that the improved CFreeEnS, free from dependence on carefully designed features, is promising in capturing generic priors and essential properties of amino acids, thereby serving as an automated feature engineering method for protein sequences.
KeywordEncoding scheme feature engineering information representation machine learning
URL查看原文
Indexed BySCI ; EI
Language英语
Funding ProjectSingapore Ministry of Education[RG21/15] ; Singapore Ministry of Education[2015-T1-001-169-11]
WOS Research AreaComputer Science ; Engineering ; Telecommunications
WOS SubjectComputer Science, Information Systems ; Engineering, Electrical & Electronic ; Telecommunications
WOS IDWOS:000457073000001
PublisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
EI Accession Number20190506450952
EI KeywordsAmino acids ; Computation theory ; Computational methods ; Diagnosis ; DNA sequences ; Encoding (symbols) ; Intelligent systems ; Learning systems ; Numerical models ; Peptides
EI Classification NumberBioengineering and Biology:461 ; Information Theory and Signal Processing:716.1 ; Computer Theory, Includes Formal Logic, Automata Theory, Switching Theory, Programming Theory:721.1 ; Data Processing and Image Processing:723.2 ; Artificial Intelligence:723.4 ; Organic Compounds:804.1 ; Mathematics:921
WOS KeywordPREDICTING ANTIGENIC VARIANTS ; INFLUENZA-VIRUS ; SEQUENCE ; REPRESENTATION ; FAMILIES
Original Document TypeArticle
Citation statistics
Cited Times [WOS]:0   [WOS Record]     [Related Records in WOS]
Document Type期刊论文
Identifierhttps://kms.shanghaitech.edu.cn/handle/2MSLDSTB/29883
Collection信息科学与技术学院_PI研究组_郑杰组
信息科学与技术学院
Corresponding AuthorKwoh, Chee-Keong
Affiliation1.Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
2.ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China
Recommended Citation
GB/T 7714
Zhou, Xinrui,Yin, Rui,Zheng, Jie,et al. An Encoding Scheme Capturing Generic Priors and Properties of Amino Acids Improves Protein Classification[J]. IEEE ACCESS,2019,7:7348-7356.
APA Zhou, Xinrui,Yin, Rui,Zheng, Jie,&Kwoh, Chee-Keong.(2019).An Encoding Scheme Capturing Generic Priors and Properties of Amino Acids Improves Protein Classification.IEEE ACCESS,7,7348-7356.
MLA Zhou, Xinrui,et al."An Encoding Scheme Capturing Generic Priors and Properties of Amino Acids Improves Protein Classification".IEEE ACCESS 7(2019):7348-7356.
Files in This Item:
File Name/Size DocType Version Access License
2019_IEEEAccess_Zhou(945KB)期刊论文作者接受稿限制开放CC BY-NC-SAView Application Full Text
Related Services
Usage statistics
Scholar Google
Similar articles in Scholar Google
[Zhou, Xinrui]'s Articles
[Yin, Rui]'s Articles
[Zheng, Jie]'s Articles
Baidu academic
Similar articles in Baidu academic
[Zhou, Xinrui]'s Articles
[Yin, Rui]'s Articles
[Zheng, Jie]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Zhou, Xinrui]'s Articles
[Yin, Rui]'s Articles
[Zheng, Jie]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: 2019_IEEEAccess_Zhou_CFreeEnS.pdf
Format: Adobe PDF
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.