An Encoding Scheme Capturing Generic Priors and Properties of Amino Acids Improves Protein Classification
Xinrui Zhou1; Rui Yin1; Jie Zheng2; Chee-Keong Kwoh1
2019
发表期刊IEEE ACCESS
ISSN2169-3536
卷号7页码:7348-7356
发表状态已发表
DOI10.1109/ACCESS.2018.2890096
摘要Feature engineering aims at representing non-numeric data with numeric features that keep the essential information of the underlying problem, and it is a non-trivial process in building a predictive model. In bioinformatics, there is a profound scale of DNA and protein sequences available, but far from being fully utilized. Computational models can facilitate the analyses of large-scale data. However, most computational models require a numeric representation as input. Expert knowledge can help design features to cast the raw symbolic data effectively. But generally, the features vary from case to case and have to be redesigned for a problem. Automated feature engineering, i.e., an encoding scheme automating the construction of features, saves the redesigning process and allows the researchers to try different representations with minimal effort. This is more in line with the explosion of data and the goal of building an intelligent system. In this paper, we introduce an encoding scheme for protein sequences, which encodes the representative sequence dataset into a numeric matrix that can be fed into a downstream learning model. The method, Context-Free Encoding Scheme (CFreeEnS), was proposed for a dataset with labels for pairwise sequences. Here, we improve the method by making it applicable to a batch of protein sequences, requiring no sequence alignment beforehand. The improved method is applied to protein classification at the functional level, including identifying antimicrobial peptides, screening tumor homing peptides, and detecting hemolytic peptides and phage virion proteins. Compared with the traditional methods using task-specific designed features, CFreeEnS improves the predicting accuracy, with an increase ranging from 5.54% to 14.14%. The results indicate that the improved CFreeEnS, free from dependence on carefully designed features, is promising in capturing generic priors and essential properties of amino acids, thereby serving as an automated feature engineering method for protein sequences.
关键词Encoding scheme feature engineering information representation machine learning
URL查看原文
收录类别SCI ; SCIE ; EI
语种英语
资助项目Singapore Ministry of Education[RG21/15] ; Singapore Ministry of Education[2015-T1-001-169-11]
WOS研究方向Computer Science ; Engineering ; Telecommunications
WOS类目Computer Science, Information Systems ; Engineering, Electrical & Electronic ; Telecommunications
WOS记录号WOS:000457073000001
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
EI入藏号20190506450952
EI主题词Amino acids ; Computation theory ; Computational methods ; Diagnosis ; DNA sequences ; Encoding (symbols) ; Intelligent systems ; Learning systems ; Numerical models ; Peptides
EI分类号Bioengineering and Biology:461 ; Information Theory and Signal Processing:716.1 ; Computer Theory, Includes Formal Logic, Automata Theory, Switching Theory, Programming Theory:721.1 ; Data Processing and Image Processing:723.2 ; Artificial Intelligence:723.4 ; Organic Compounds:804.1 ; Mathematics:921
WOS关键词PREDICTING ANTIGENIC VARIANTS ; INFLUENZA-VIRUS ; SEQUENCE ; REPRESENTATION ; FAMILIES
原始文献类型Article
来源库IEEE
引用统计
文献类型期刊论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/29883
专题信息科学与技术学院
信息科学与技术学院_PI研究组_郑杰组
作者单位1.School of Computer Science and Engineering, Nanyang Technological University, Singapore
2.School of Information Science and Technology, ShanghaiTech University, Shanghai, China
推荐引用方式
GB/T 7714
Xinrui Zhou,Rui Yin,Jie Zheng,et al. An Encoding Scheme Capturing Generic Priors and Properties of Amino Acids Improves Protein Classification[J]. IEEE ACCESS,2019,7:7348-7356.
APA Xinrui Zhou,Rui Yin,Jie Zheng,&Chee-Keong Kwoh.(2019).An Encoding Scheme Capturing Generic Priors and Properties of Amino Acids Improves Protein Classification.IEEE ACCESS,7,7348-7356.
MLA Xinrui Zhou,et al."An Encoding Scheme Capturing Generic Priors and Properties of Amino Acids Improves Protein Classification".IEEE ACCESS 7(2019):7348-7356.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Xinrui Zhou]的文章
[Rui Yin]的文章
[Jie Zheng]的文章
百度学术
百度学术中相似的文章
[Xinrui Zhou]的文章
[Rui Yin]的文章
[Jie Zheng]的文章
必应学术
必应学术中相似的文章
[Xinrui Zhou]的文章
[Rui Yin]的文章
[Jie Zheng]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。