·1997    ·2005 

·1998    ·2006

·2000    ·2007

·2001    ·2008

·2002    ·2009




1、BioSunMS: a plug-in-based software for the management of patients information and the analysis of peptide profiles from mass spectrometry

Yuan Cao, Na Wang, Xiaomin Ying, Ailing Li, Hengsha Wang, Xuemin Zhang and Wuju Li

BMC Medical Informatics and Decision Making 2009, 9:13doi:10.1186/1472-6947-9-13

With wide applications of matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) and surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS), statistical comparison of serum peptide profiles and management of patients information play an important role in clinical studies, such as early diagnosis, personalized medicine and biomarker discovery. However, current available software tools mainly focused on data analysis rather than providing a flexible platform for both the management of patients information and mass spectrometry (MS) data analysis.
Here we presented a plug-in-based software, BioSunMS, for both the management of patients information and serum peptide profiles-based statistical analysis. By integrating all functions into a user-friendly desktop application, BioSunMS provided a comprehensive solution for clinical researchers without any knowledge in programming, as well as a plug-in architecture platform with the possibility for developers to add or modify functions without need to recompile the entire application.
BioSunMS provides a plug-in-based solution for managing, analyzing, and sharing high volumes of MALDI-TOF or SELDI-TOF MS data. The software is freely distributed under GNU General Public License (GPL) and can be downloaded from http://sourceforge.net/projects/biosunms/

Full Text Download:


2、sRNATarget: a web server for prediction of bacterial sRNA targets

Yuan Cao, Yalin Zhao, Lei Cha, Xiaomin Ying, Ligui Wang, Ningsheng Shao, Wuju Li*

Bioinformation 3(8): 364-366 (2009)

Abstract:In bacteria, there exist some small non-coding RNAs (sRNAs) with 40–500 nucleotides in length. Most of them function as post-transcriptional regulation of gene expression through binding to their target mRNAs, in which Hfq protein acts as RNA chaper-one. With the increase of identified sRNA genes in the bacterium, prediction of sRNA targets plays a more important role in determining sRNA functions. However, there are few available computational tools for predicting sRNA targets at present. Here we introduced a web server, sRNATarget, for genome-scale prediction of bacterial sRNA targets. The server is based on a re-cently published model which uses Naive Bayes method as the supervised method and take RNA secondary structure profile as the feature. The prediction results will be returned to the users through E-mail.

Full Text Download:


3、BatchGenAna: a batch platform for large-scale genomic analysis of mammalian small RNAs

Xiaomin Ying, You Jung Kim, Yiqing Mao, Ming Liu, Yanyan Hou, Hua Li, Xiaolei Wang, Yalin Zhao, Dongsheng Zhao, Jignesh M. Patel, Wuju Li*

Bioinformation 3(8): 346-348 (2009)

Abstract:An increasing number of small RNAs have been discovered in mammals. However, their primary transcripts and upstream regulatory networks remain largely to be determined. Genomic analysis of small RNAs facilitates identification of their primary transcripts, and hence contributes to researches of their upstream regulatory networks. We here report a batch platform, BatchGenAna, which is specifically designed for large-scale genomic analysis of mammalian small RNAs. It can map and annotate for as many as 1000 small RNAs or 10,000 genomic loci of small RNAs at a time. It provides genomic features including RefSeq genes, mRNAs, ESTs and repeat elements in tabular and graphical results. It also allows extracting flanking sequences of submitted queries, specified genomic regions and host transcripts, which facilitates subsequent analysis such as scanning transcription factor binding sites in upstream sequences and poly(A) signals in downstream sequences. Besides small RNA fields, BatchGenAna can also be applied to other research fields, e.g. in silico analysis of target genes of transcription factors.

Full Text Download:





摘要:细菌sRNA是一类长度在40~500 nt之间的非编码RNA,主要以不完全碱基配对方式与靶标mRNA5′端相互作用进而发挥其生物学功能。鉴于预测方法可以为细菌sRNA及其靶标的实验发现提供指导,因此,细菌sRNA与靶标预测研究受到了广泛重视。文章首先将sRNA预测方法分为3类,分别是基于比较基因组学的预测方法、基于转录单元的预测方法和基于机器学习的预测方法;其次,将sRNA靶标预测方法分为2类,分别是序列比较方法与基于RNA二级结构的预测方法;最后对各类方法的原理、核心思想、优点和局限性进行了分析,并探讨了进一步的发展方向。

Full Text Download:






Full Text Download:




1.Construction of two mathematical models for prediction of bacterial sRNA targets

Yalin Zhao, Hua Li, Yanyan Hou, Lei Cha, Yuan Cao, Ligui Wang, Xiaomin Ying, Wuju Li

Biochemical and Biophysical Research Communications,2008, 372:346–350


Abstract: Accurate prediction of sRNA targets plays a key role in determining sRNA functions. Here we introduced two mathematical models, sRNATargetNB and sRNATargetSVM, for prediction of sRNA targets using Naı¨ve Bayes method and support vector machines (SVM), respectively. The training dataset was composed of 46 positive samples (real sRNA–targets interaction) and 86 negative samples (no interaction between sRNA and targets). The leave-one-out cross-validation (LOOCV) classification accuracy was 91.67% for sRNATargetNB, and 100.00% for sRNATargetSVM. To evaluate the performance of the models, an independent test dataset was used, which contained 22 positive samples and 1700 randomly generated negative samples. The results showed that the classification accuracy, sensitivity, and specificity were 93.03%, 40.90%, and 93.71% for sRNATargetNB and 80.55%, 72.73%, and 80.65% for sRNATargetSVM,respectively. Therefore, the presented models provide support for experimental identification of sRNA targets.The related software and supplementary materials can be downloaded from webpage www.biosun.org.cn/srnatarget/.


Full Text Download:




2.Identification and verification of microRNA in wheat


Weibo Jin, Nannan Li, Bin Zhang, Fangli Wu, Wuju Li, Aiguang Guo, Zhiyong Deng

J Plant Res, 2008, 121:351–355


Abstract: MicroRNAs (miRNAs) are small, endogenous RNAs that regulate gene expression in both plants and animals. A large number of miRNAs has been identified from various animals and model plant species such as Arabidopsis thaliana and rice (Oryza sativa); however, characteristics of wheat (Triticum aestivum) miRNAs are poorly understood. Here, computational identification of miRNAs from wheat EST sequences was preformed by using the in-house program GenomicSVM, a prediction model for miRNAs. This study resulted in the discovery of 79 miRNA candidates. Nine out of 22 miRNA representatives randomly selected from the 79 candidates were experimentally validated with Northern blotting, indicating that prediction accuracy is about 40%. For the 9 validated miRNAs, 59 wheat ESTs were predicted as their putative targets.


Full Text Download:









[摘要 目的:为实验方法鉴定sRNA靶标和研究sRNA功能提供生物信息学支持。方法:首先以实验证实的132sRNA与靶标相互作用数据为训练集,其中包含46个阳性数据和86个阴性数据;其次,以实验证实的22个阳性数据和随机生成的1700个阴性数据为测试集;最后以RNA二级结构谱等特征为变量,运用支持向量机方法构建sRNA靶标预测数学模型。结果和结论:构建的模型对训练集的敏感性和特异性均为100%,对测试集的敏感性和特异性分别为72.73%80.65%。所构建的数学模型为实验发现sRNA靶标提供了生物信息学支持。


Full Text Download:









[摘要] 目的:构建具有高敏感性和高特异性的microRNA前体(p re2miRNA)识别模型。方法:根据300例经实验验证的人p re2miRNA和300例从3′UTR折成茎环结构的片段中随机选取的阴性样本,基于支持向量机方法构建了区分p re2miRNA和p seudo p re2miRNA的分类器MiRscreen。为提高分类器的性能,我们采用遗传算法搜索影响分类器性能的2个重要参数C和γ。结果与结论:该分类器对训练集的敏感性为99. 33% ,特异性为100% ,对剩余的91例人p re2miRNA和91例3′UTR中的p seudo p re2miRNA敏感性和特异性分别达到91. 21% ( 83 /91) 和93. 41%(85 /91) 。在除人以外的其他20种动物和病毒的1 353例p re2miRNA中,MiRscreen正确判断出其中的1 192例,敏感性达到88. 10% ,其中马雷克病病毒、猕猴淋巴隐病毒、EB病毒、猿猴病毒40、非洲爪蟾、狗、绵羊和猕猴共计8个物种的敏感性达到100%;在随机抽取的100条RefSeq基因折叠形成的556例p seudo p re2miRNA和随机抽取的797例人19号染色体折叠形成的p seudo p re2miRNA (共计1 353例混合阴性样本)中,MiRscreen的特异性达到85. 14%(1 152 /1 353) 。与其他6种同类方法相比,MiRscreen在敏感性和特异性方面均具有较好的性能,分类精度最高,
达到86. 62% ,比其他方法高6%以上; MiRscreen的AUC值达到0. 938,也明显高于其他方法。
[关键词]  微RNAs;识别;遗传算法;支持向量机


Full Text Download:










摘要: microRNA (miRNA)是近几年发现的一类长度为~21 nt 的内源非编码小RNA, 在植物和动物中发挥着重要而广泛的调控功能。它的发现主要有cDNA 克隆测序和计算发现两条途径。由于cDNA 克隆测序方法受miRNA 表达的时间和组织特异性以及表达水平的影响, 而计算发现可以弥补其不足, 因此miRNA 的计算发现方法研究受到了广泛的重视。文章对近几年计算发现miRNA 的研究进展进行了综述, 根据计算发现方法的本质, 将计算发现方法归纳为5 类, 分别是同源片段搜索方法、基于比较基因组学的预测方法、基于序列和结构特征打分的预测方法、结合作用靶标的预测方法和基于机器学习的预测方法, 并对各类方法的原理、核心思想、优点和局限性进行了分析, 最后探讨了进一步的发展方向。
关键词: microRNA; 计算发现; 同源搜索; 比较基因组学; 作用靶标; 机器学习


Full Text Download:








 癌变 畸变 突变, 2008,20:85-88


 【摘要】背景与目的:运用电子克隆等生物信息学方法研究筛查出的48 个与食管癌相关功能未知的DNA 序列片段,为食管癌相关研究提供指导。材料与方法:以48个DNA序列片段为核心,运用BioEdit建立本地数据库;通过电子克隆的方法对48个WRU 序列中功能未知的基因片段进行序列延伸;通过blast同源分析搜索48个基因的内含子以及上下游基因间隔区中存在的非编码ncRNA。结果:48个 DNA序列中功能未知的基因片段通过电子克隆的方法平均能够延伸190bp以上;在48个基因的内含子以及上下游基因间隔区存在着与已知ncRNA相似性 很高的片段。结论:运用电子克隆的方法可以使某些食管癌相关功能未知基因的序列得以明显延伸;一些食管癌相关基因所在的染色体区段存在着某些与ncRNA 高度相似的片段,这提示我们,ncRNA可能参与食管癌的发生过程,其具体功能有待深入研究。


Full Text Download:





 相 磊,陈小玲,李伍举, 梁之昶,章振华,王学文,胡国良,徐福洲,石 岗



摘 要:为建立口蹄疫病毒( Foot2and2mouth disease virus ,FMDV) 不同血清型与基因型的基因芯片检
测方法,设计针对O 型8 个基因型、A 型3 个基因型和亚洲1 型的特异性探针。从美国GenBank 与英国世界口蹄疫参考实验室基因库下载了O 型、A 型和亚洲1 型FMDV 的VP1 基因序列547 条。对每一血清型序列用DNA Star 软件ClastalW 程序进行多重比对,做系统发育分析并进行基因分型。用生物学软件BioSun 2. 0 建立基因型数据库,设计每一基因型的特异性探针。共设计出104 条候选探针,通过芯片试验筛选出12 条特异性探针。以各型特异性探针所对应的靶序列模板做10 倍系列稀释进行PCR 扩增,扩增产物与探针杂交,验证各探针的灵敏度。对O 型SEA、Euro2SA、ME2SA、WA 4 个基因型的各条探针的灵敏度进行了检验,结果这些探针能够检测到102 数量级拷贝数的阳性靶标。
关键词:口蹄疫病毒;血清型;基因型;VP1 基因;基因芯片;寡核苷酸探针


Full Text Download:





1、Construction of mathematical model for high-level expression of foreign genes in pPIC9 vector and its verification

Bingli Wu, Lei Cha, Zepeng Du, Xiaomin Ying, Hua Li, Liyan Xu, Xiaofei Zheng, Enmin Li, Wuju Li

Biochemical and Biophysical Research Communications,2007, 354:498–504  

Abstract: In this report, we introduced a mathematical model for high-level expression of foreign genes in pPIC9 vector. At first, we collected 40 heterologous genes expressed in pPIC9 vector, and these 40 genes were classified into high-level expression group (expression level >100mg/L, 12 genes) and low-level expression group (expression level <100mg/L, 28 genes). Then, the Naive Bayes method was used to construct the model with RNA secondary structure profile of 3'-end of foreign genes as features. The classification accuracy from leave-one-out cross-validation was 100%. Finally, another five genes collected from literatures were used to test the ability of the model. The results indicated that there were four genes correctly predicted. In addition, the model was also verified by expressing human neutrophil gelatinase-associated lipocalin (NGAL) gene with expression level more than 100mg/L. Therefore, we propose that the model can be used to predict the expression level of heterologous genes before experiments and optimize the experiment designs to obtain the high-level expression. Furthermore, we have developed a web server for evaluation and design for high-level expression of foreign genes, which is accessible at http://ppic9.med.stu.edu.cn/ppic9

Full Text Download:


2、Predicting siRNA efficiency

W. Li and L. Cha

Cell. Mol. Life Sci., 2007, 64:1785 – 1792


Abstract:Since the identification of RNA-mediated interference (RNAi) in 1998, RNAi has become an effective tool to inhibit gene expression. The inhibition mechanism is triggered by introducing a short interference double-stranded RNA (siRNA,19~27 bp) into the cytoplasm, where the guide strand of siRNA (usually antisense strand) binds to its target messenger RNA and the expression of the target gene is blocked. RNAi has been widely applied in gene functional analysis, and as a potential therapeutic strategy in viral diseases, drug target discovery, and cancer therapy. Among the factors which may compromise inhibition efficiency, how to design siRNAs with high efficiency and high specificity to its target gene is critical. Although many algorithms have been developed for this purpose, it is still difficult to design such siRNAs. In this review, we will briefly discuss prediction methods for siRNA efficiency and the problems of present approaches.

Full Text Download:





摘要:MicroRNA(miRNA) 是一类存在于动植物体内,长度为21~25nt的内源性小RNA,对生物体的转录后基因调控起着关键作用,但一些低丰度的miRNA和组织特异性 miRNA往往很难发现.为了系统识别拟南芥基因组中新的非同源miRNA,首先基于已报道的拟南芥miRNA的特征,从全基因组范围中筛选出453条可能的miRNA前体:其次,为了进一步对上述miRNA前体进行筛选,利用人的miRNA前体数据构建了支持向量机模型GenomicSVM,该模型对人测试集的敏感性和特异性分别为86.3﹪和98.1﹪(30个人miRNA前体和1 000个阴性miRNA前体),对拟南芥测试集的正确率为93.6﹪(78个miRNA前体);最后,利用GenomicSVM预测上述453条 miRNA前体序列,得到了37条候选的新的拟南芥miRNA前体,为进一步的miRNA实验发现研究提供了指导.

Full Text Download:  






摘要:ncRNA 和mRNA一样,都是重要的功能分子。以k-tuple(k字)含量为特征,对酵母ncRNA成熟序列和mRNA的编码区、上游序列与下游序列进行了分类与比较研究,结果显示:基于ncRNA成熟序列与mRNA编码区的3-tuple的含量,ncRNA和mRNA的交叉有效性分类精度(leave-one out cross—validation,LOOCV)平均值达到93.93%;基于上游序列4-tuple和5-tuple的含量,分类精度分别为 92.49%和92.76%;基于下游序列4-tuple和5-tuple的含量,分类精度分别为91.58%和90.60%;利用上游序列和下游序列的 4-tuple与5-tuple的含量,其平均分类精度分别为94.68%和94,83%;通过t检验,得到了在ncRNA和mRNA上、下游序列中具有显著统计学差异的k-tuple。上述结果表明,基于ncRNA成熟序列与mRNA编码区的3-tuple含量和基于ncRNA与mRNA上、下游序列的 4或5-tuple含量可以有效地区分ncRNA与mRNA。此研究结果不仅有助于准确识别ncRNA与mRNA,还有助于发现ncRNA特异的转录因子结合位点。


Full Text Download:



查磊, 应晓敏, 曹源, 李华, 李伍举


摘要:我们曾于2004年推出了计算机辅助分子生物学实验设计的软件系统BioSun 1.0,该系统提供了较为全面的数据处理与分析功能.为了更好地服务于生物医学工作者,我们对该软件系统进行了升级,推出了2.0版本,新增的功能主要有:基于Blast的多种形式的序列比对、基于ClustalW的多序列比对与进化树构建、蛋白质三维结构展示、基于RNAfold的RNA二级结构预测和序列格式转换等.通过与商业化综合性的生物信息学软件系统DNASIS MAX 2.05、DNAStar 5.0、Vector NTI 9.1和BioEdit 7.0 的比较发现,BioSun2.0具有操作简便、功能众多和性价比高等特点,能够满足生物医学实验室的常规需求

Full Text Download:


3.Mprobe 2.0:Computer-Aided Probe Design for Oligonucleotide Microarray  

 Wuju Li, Xiaomin Ying  

 Applied bioinformatics, 2006, 5:181-186

Abstract: DNA chips have proven to be effective tools in detecting gene expression levels. Compared with DNA chips using complementary DNA as probes, oligonucleotide microarrays using oligonucleotides as probes have attracted great attention because of their well known advantages. The design of gene-specific probes for each target is essential to the development of oligonucleotide microarrays. We have previously reported the development of a probe design software termed Mprobe 1.0. Here, we present a new version of this software, termed Mprobe 2.0. Several new features are included in Mprobe 2.0. Firstly, a paradox-based sequence database management system has been developed and integrated into the software, which consequently allows interoperability with sequences in GenBank, EMBL, and FASTA formats. Secondly, in contrast to setting a fixed threshold for the secondary structure of probes in Mprobe 1.0 and other related software, Mprobe 2.0 employs a different method. After parameters such as GC type, probe melting temperature and GC contents have been evaluated, candidate probes are sorted by the free energy from high to low value, followed by specificity analysis. Thirdly, Mprobe 2.0 provides users with substantial parameter options in the visual mode. Mprobe 2.0 possesses an easier interface for users to manage sequences annotated in different formats and design the optimal probes for oligonucleotide microarrays and other applications. AVAILABILITY: The program is free for non-commercial users and can be downloaded from the web page

Full Text Download:



1.How many genes are needed for early detection of breast cancer, based on gene expression patterns in peripheral blood cells?

Wuju Li

Breast Cancer Research, 2005, vol. 7 (5): E5.  

Abstract: In their recent report [1], Sharma and coworkers explore the early detection of breast cancer. They analyzed a gene expression data set (1368 genes in 62 normal and 40 tumour samples, including sample duplication in different batches) using the nearest shrunken centroid method. They identified a panel of 37 genes that permitted early detection, with the classification accuracy being about 82%. This is a typical problem with sample classification based on gene expression profiling. The objective is to achieve high prediction accuracy with as few genes as possible, and so feature selection plays an important role; examination of a large number of genes will increase the dimensionality, computational complexity, and clinical cost. According to our previous study of data sets from patients with colon cancer, leukaemia and breast cancer [2], we estimated that five or six genes – rather than 37 -would be sufficient for the early detection of beast cancer [1]. So how many genes are indeed needed? In order to address this question, we evaluated the data presented by Sharma and coworkers using the Tclass system [2].

In the Tclass system, Fisher's linear discriminant analysis and a step-wise optimization procedure for feature selection are used to analyze a batch adjusted data set [1] in two ways. The first is to take the prediction accuracy from the training set as the object function. The second way is to take the classification accuracy from the leave-one-out cross-validation as the object function. For the former, the selected optimal feature sets are evaluated by randomly dividing all tissue samples into a training set (e.g. 50%, 67%, or 85% of samples) and a test set 200 times. The relationship between the prediction accuracy and the number of genes is illustrated in Fig. 1, which shows that the greatest prediction accuracy was achieved using six genes (Fig. 1a); other peaks in accuracy occurred when 10, 13, or 15 genes were used (Fig. 1b). Furthermore, two genes – the 481th (BC009696) and the 801th (BC000514) – permitted classification accuracy as high as 86%, which is greater than the 82% achieved by Sharma and coworkers [1] with the selected 37 genes.


Full Text Download:



2.An approach to studying lung cancer-related proteins in human blood

Ting Xiao, Wantao Ying, Lei Li, Zhi Hu, Ying Ma, Liyan Jiao, Jinfang Ma, Yun Cai, Dongmei Lin, Suping Guo, Naijun Han, Xuebing Di, Min Li, Dechao Zhang, Kai Su, Jinsong Yuan, Hongwei Zheng, Meixia Gao, Jie He, Susheng Shi, Wuju Li, Ningzhi Xu, Husheng Zhang, Yan Liu, Kaitai Zhang, yanning Gao, Xiaohong Qian, and Shujun Cheng

Molecular & Cellular Proteomics, 2005, published online.

Abstract: Early-stage lung cancer detection is the first step towards successful clinical therapy and increased patient survival. Clinicians monitor cancer progression by profiling tumor cell proteins in the blood plasma of afflicted patients. Blood plasma, however, is a difficult cancer protein assessment media, because it is rich in albumins and heterogeneous protein species. We report herein a method to detect the proteins released into the circulatory system by tumor cells. Initially, we analyzed the protein components in the conditional medium (CM) of lung cancer primary cell or organ cultures, and in the adjacent normal bronchus using 1-D PAGE and nano-ESI-MS/MS. We identified 299 proteins involved in key cellular process such as cell growth, organogenesis and signal transduction. We selected 13 interesting proteins from this list, and analyzed them in 628 blood plasma samples using ELISA. We detected 11 of these 13 proteins in the plasma of lung cancer patients and non-patient controls. Our results showed that plasma MMP1 levels were elevated significantly in late-stage lung cancer patients, and that the plasma levels of 14-3-3 sigma, beta and eta in the lung cancer patients were significantly lower than those in the control subjects. To our knowledge, this is the first time that fascin, ezrin, CD98, annexin A4, 14-3-3 sigma, 14-3-3 beta and 14-3-3 eta proteins have been detected in human plasma by ELISA. The preliminary results showed that a combination of CD98, fascin, PIGR/SC and 14-3-3 eta had a higher sensitivity and specificity than any single marker. In conclusion, we report a method to detect proteins released into blood by lung cancer. This pilot approach may lead to the identification of novel protein markers in blood and provide a new method of identifying tumor biomarker profiles for guiding both early detection and therapy of human cancer.

Full Text Download:



1.Genome Class Prediction Based on Amino Acid Composition(AAC) from Proteomes

Wuju Li, Tao Liu, Xiaomin Ying, and Ming Fa

Molecular & Cellular Proteomics, 2004, vol.3 (10): S79.

Abstract: With genomic sequences from three domains of life become increasingly available, the relationships between the AAC and the genome classes (organisms' phenotype) have been widely studied in the following two aspects. The first aspect is to concentrate on the difference of AAC of proteins from particular type or whole proteomes in different genome classes. The second aspect is to study the issue of genome class prediction based on the AAC. The purpose of the above two aspects is to explain why certain organisms can live in extreme conditions of temperature, salinity, or pressure. Here we want to emphasize whether there is a possibility to predict the genome classes as accurately as possible using small subsets of amino acids. In order to investigate the issues systematically, the Fisher linear discriminate analysis (FLDA) was applied to the following four data sets DOMAIN, LIFE, HTHAB, and ARCHAEA. The DOMAIN is about the three domains of life (16 archaea, 75 bacteria, and 6 eukaryotic genomes). The LIFE is about the three lifestyles (13 HTH, 4 TH, and 79 MES). The HTHAB includes 10 HTH in archaea and 3 HTH in bacteria. The ARCHAEA is about the three lifestyles in archaea (10 HTH, 3 TH, and 3 MES). By using the feature selection method of all possible combinations of features (amino acids), we found that the cross-validation accuracies for above four data sets could reach 94.8%, 97.9%, 100.0%, and 100.0% by only using the compositions of four (A, I, K, and Q), five (I, K, P, V, and Y), two (E and Q), and two (M and Q) amino acids respectively. The average cross-validation accuracy reaches 98.2%. Therefore, AAC from the proteomes provides an alternative way to determine the genome classes such as the lifestyle or the domains of life. According to what we know, the correspondence analysis, principal component analysis (PCA), and hierarchical cluster analysis have been applied to study the distinction of different genome classes using the AAC, but the classification methods have not been used. Therefore, our work represents a first attempt on this effort in this field.

PDF Abstract Download:


2.RDfold:a web server for prediciton of RNA secondary structure

Xiaomin Ying, Hong Luo, Jingchu Luo and Wuju Li

Nucleic Acids Research, 2004, vol.32: W150-W153.

Abstract: Prediction of RNA secondary structure is important in the functional analysis of RNA molecules. The RDfolder web server described in this paper provides two methods for prediction of RNA secondary structure: random stacking of helical regions and helical regions distribution. The random stacking method predicts secondary structure by Monte Carlo simulations. The method of helical regions distribution predicts secondary structure based on the helices that appear most frequently in the set of structures, which are generated by the random stacking method. The RDfolder web server can be accessed at http://rna.cbi.pku.edu.cn.

Full Text Download:




李伍举, 应晓敏

军事医学科学院院刊2004 vol. 28(5): 401-404

摘要:论述了我们自行研究与开发的分子生物学实验辅助设计的生物信息学软件系统BioSun,运行于Windows环境。其主要功能有:可视化的序列编辑、可接收多种序列格式(EMBL, GenBank和FastA)的数据库管理系统、多种方式的序列比较、多种方式的抗原表位预测、基于多种算法的RNA二级结构预测、酶切位点分析及酶切图谱制作、PCR实验辅助设计、辅助寡核苷酸微阵列的探针设计、辅助cDNA微阵列的引物设计和原核系统外源基因高效表达设计等。BioSun系统使用图形用户界面方式,可实现对图形与文本文件的灵活管理,具有操作灵活、功能多样等特点,可用于分子生物学实验辅助设计,对加快实验进程和提高实验的成功率具有较大意义。




1.Samcluster:An integrated scheme for automatic discovery of sample classes using gene expression profile

Wuju Li, Ming Fan and Momiao Xiong

Bioinformatics, 2003, vol.19: 811-817

Motivation: Feature (gene) selection can dramatically improve the accuracy of gene expression profile based sample class prediction. Many statistical methods for feature (gene) selection such as stepwise optimization and Monte Carlo simulation have been developed for tissue sample classification. In contrast to class prediction, few statistical and computational methods for feature selection have been applied to clustering algorithms for pattern discovery.
Results: An integrated scheme and corresponding program SamCluster for automatic discovery of sample classes based on gene expression profile is presented in this report. The scheme incorporates the feature selection algorithms based on the calculation of CV (coefficient of variation) and t-test into hierarchical clustering and proceeds as follows. At first, the genes with their CV greater than the pre-specified threshold are selected for cluster analysis, which results in two putative sample classes. Then, significantly differentially expressed genes in the two putative sample classes with p-values 0.01, 0.05, or 0.1 from t-test are selected for further cluster analysis. The above processes were iterated until the two stable sample classes were found. Finally, the consensus sample classes are constructed from the putative classes that are derived from the different CV thresholds, and the best putative sample classes that have the minimum distance between the consensus classes and the putative classes are identified. To evaluate the performance of the feature selection for cluster analysis, the proposed scheme was applied to four expression datasets COLON, LEUKEMIA72, LEUKEMIA38, and OVARIAN. The results show that there are only 5, 1, 0, and 0 samples that have been misclassified, respectively. We conclude that the proposed scheme, SamCluster, is an efficient method for discovery of sample classes using gene expression profile.
Availability: The related program SamCluster is available upon request or from the web page http://www.sph.uth.tmc.edu:8052/hgc/Downloads.asp
or http://www.biosun.com.cn/softwares/samcluater.html


Full Text Download:




李伍举. 刘涛.

解放军医学杂志 2003 vol.28(6):S9-S10

摘要:[目的] 采用集Hopp&Woods亲水性、Janin表面可及性、Karplus-Schulz主链柔软性和电荷分布为一体的综合性抗原表位预测方法和蛋白质二级结构预测对SARS病毒的两个膜蛋白S和M进行抗原表位预测,以便为SARS病毒的疫苗设计提供依据。[结果]通过运用Goldkey等软件分析了SARS病毒的两个膜蛋白S和M的抗原表位,分别获得了14个和7个可能的抗原表位。


Full Text Download:    



刘涛. 李伍举. 范明.

解放军医学杂志 2003 vol.28(6):S1-S5

摘要:2003 年3月以来,一种新冠状病毒(SARS-CoV)被初步确定为2002年底爆发的致死性传染病——严重急性呼吸综合症(Severe Acute Respiratory Syndrome,即SARS)的病原。该病毒具有其他已知冠状病毒典型的基因组结构。对该病毒进行系统发生学分析对进一步的实验研究具有指导意义。我们首先通过构建SARS-CoV在全基因组水平上的系统发生树来明确其演化位置,然后分别从核酸和蛋白两个水平分析了SARS-CoV的5个主要同源蛋白 ——复制酶、S蛋白、E蛋白、M蛋白和N蛋白的系统发生树。结果表明,SARS-CoV与目前已知的冠状病毒同源,但具有与其它冠状病毒明显不同的特点 ——各同源基因的演化历史彼此不同,其中结构蛋白基因的演化历史与基因组的演化历史不同;SARS-CoV与IBV和TGV尤其是IBV的亲缘关系较近,尤其是在E蛋白和M蛋白两水平上的特殊近缘关系在进一步的实验研究中值得注意和参考。

Full Text Download:  



张玉梅. 孙长凯. 范明. 李伍举. 刘淑红. 赵杰. 韩大跃. 王嘉玺.

中国生物化学与分子生物学报 2003 vol.19(5):588-593

摘要:用基因工程方法获得人N甲基D天冬氨酸(N methyl D aspartate, NMDA)受体主亚基M3 M4环靶片段,以此为免疫原,用于进一步免疫原性及相关应用研究.自人脑胶质瘤组织中提取总RNA ,采用RT PCR扩增出人NMDA受体主亚基M3 M4环的基因片段,并按照计算机辅助原核表达载体pBV220中外源基因高效表达的数学模型预测方法,将其进行优化改构.将目的基因克隆到pBV2 2 0中,转化大肠杆菌DH5α,升温诱导表达,从蛋白质水平检测重组体在大肠杆菌中的表达情况,通过制备性SDS PAGE进行纯化,从相对分子质量、免疫反应性、肽质谱指纹分析等方面进行鉴定.结果表明,成功构建了人NMDA受体主亚基M3 M4环的原核表达载体(命名为pBV NR1L3) ,通过基因优化,实现了高效表达.凝胶扫描分析表达量约占菌体总蛋白29% ,重组肽纯度达95%以上。




1.Tclass: Tumor Classification System Based on Gene Expression Profile

Li Wuju and Xiong Momiao

Bioinformatics 2002, vol.18: 325-326

Summary: A method that incorporates feature selection into Fisher’s linear discriminant analysis for gene expression based tumor classification and a corresponding program Tclass were developed. The proposed method was applied to a public gene expression data set for colon cancer that consists of 22 normal and 40 tumor colon tissue samples to evaluate its performance for classification. Preliminary results demonstrated that using only a subset of genes ranging from 3 to 10 can achieve high classification accuracy.
Availability: The program is written in Matlab and is being rewritten in the Java language. The source code is available upon request.

Full Text Download:



2.MProbe: computer aided probe design for oligonucleotide microarrays

Wuju Li, Jian Huang, Ming Fan, Shengqi Wang

Applied Bioinformatics 2002:1(3):163-166.

Abstract: The present work describes a complete probe design software system for oligonucleotide microarrays based on Kane’s research on probe sensitivity and specificity (Kane’s rule). Combining Kane’s rule and traditional criteria for probe design we constructed MProbe, the software system for oligonucleotide microarrays using Java. The general criteria for probe design are: (1) probes may have different lengths that range from 20 to 100 bases; (2) they should have a similar melting temperature (Tm) or GC content; (3) they should not contain stable secondary structures; and (4) they abide by Kane’s rule.





军事医学科学院院刊,2002 vol.26(1):73-76.

摘要:DNA 微阵列技术是继DNA重组技术、PCR扩增技术之后的又一重大生物技术。基于微阵列实验,可以同时观察在某一生命现象中成千上万个基因的动态表达水平。与过去的研究模式即单个基因的表达研究相比,分子生物学工作者的观念将由此发生巨大改变,使得人们能够在基因组水平上以系统的、全局的观念去研究生命现象及其本质。目前,微阵列技术已应用到肿瘤分型、肿瘤分类、基因功能研究、基因之间调控网络构建、药物靶位识别等许多方面,但是,从本质上讲,通过微阵列实验所直接获得的是一个基因表达谱(即基因表达矩阵,行表示基因,列表示实验样本),微阵列的实际应用就是通过对基因表达矩阵的生物信息学处理来实现的,因此,在由微阵列技术为基础的分子生物学研究中,生物信息学是其中极其重要的一环,本文就与基因表达谱相关的生物信息学方法作一综述。




孙长凯. 赵杰. 李伍举. 冯健男. 刘淑红.等

中华医学杂志 2002 vol.82(1):50-53

摘要 目的:分析人N2甲基2D2门冬氨酸受体(NMDAR)主亚基NR1a上两个受体激活相关多肽P1、P2的抗原性及其理化特性。方法:用GOLDKEY软件从蛋白质数据库中调出人NR1a分子的氨基酸序列,分别在其第一、第三跨膜域前后逆向、顺向截取151和144个氨基酸长度的多肽片段P1与P2,选取 Hopp&Woods与Kyte亲水性、Janin表面可及性、Karplus2Schulz主链柔韧性及Welling抗原性等参数予以多参数分析,采用Prosite程序与Chou2Fasman方法比较其氨基酸位点与二级结构特征,以此为基础综合判定P1与P2片段的抗原位点并与已有的实验结果相比较。结果:P1、P2多肽片段上可能分别有6和7个8~15aa长序列具有良好的抗原性。P1相关序列主要分布于其氨基端,与配体结合关键氨基酸残基相距较远。 P2上的相关序列分布较均匀,包含有受体激活重要相关位点或与配体结合关键氨基酸残基距离较近。P2片段的总体抗原性、亲水性与可及性均强于P1,尤以其近膜的15个残基为著。P1、P2多肽片段均含有一定数量的β2转角,但P1片段含有较多的半胱氨酸残基和无规卷曲,而P2片段则含有较多的芳香族残基并以α螺旋结构为主。结论:人NMDAR主亚基NR1a上的两个受体激活相关多肽P1、P2均具有一定数量的抗原位点,与P1相比较,P2可能更易成为 NMDAR免疫干预的分子靶点。




1.Feature (gene) selection in gene expression-based tumor classification

Xiong M, Li W, Zhao J, Jin L, Boerwinkle E.

Mol Genet Metab. 2001 vol.73(3):239-47.

Abstract: There is increasing interest in changing the emphasis of tumor classification from morphologic to molecular. Gene expression profiles may offer more information than morphology and provide an alternative to morphology-based tumor classification systems. Gene selection involves a search for gene subsets that are able to discriminate tumor tissue from normal tissue, and may have either clear biological interpretation or some implication in the molecular mechanism of the tumorigenesis. Gene selection is a fundamental issue in gene expression-based tumor classification. In the formation of a discriminant rule, the number of genes is large relative to the number of tissue samples. Too many genes can harm the performance of the tumor classification system and increase the cost as well. In this report, we discuss criteria and illustrate techniques for reducing the number of genes and selecting an optimal (or near optimal) subset of genes from an initial set of genes for tumor classification. The practical advantages of gene selection over other methods of reducing the dimensionality (e.g., principal components), include its simplicity, future cost savings, and higher likelihood of being adopted in a clinical setting. We analyze the expression profiles of 2000 genes in 22 normal and 40 colon tumor tissues, 5776 sequences in 14 human mammary epithelial cells and 13 breast tumors, and 6817 genes in 47 acute lymphoblastic leukemia and 25 acute myeloid leukemia samples. Through these three examples, we show that using 2 or 3 genes can achieve more than 90% accuracy of classification. This result implies that after initial investigation of tumor classification using microarrays, a small number of selected genes may be used as biomarkers for tumor classification, or may have some relevance in tumor development and serve as a potential drug target. In this report we also show that stepwise Fisher's linear discriminant function is a practicable method for gene expression-based tumor classification.




1.Computational methods for gene expression-based tumor classification

Xiong M, Jin L, Li W, Boerwinkle E.

Biotechniques. 2000 vol.29(6):1264-8,1270.

Abstract: Gene expression profiles may offer more or additional information than classic morphologic- and histologic-based tumor classification systems. Because the number of tissue samples examined is usually much smaller than the number of genes examined, efficient data reduction and analysis methods are critical. In this report, we propose a principal component and discriminant analysis method of tumor classification using gene expression profile data. Expression of 2000 genes in 40 tumor and 22 normal colon tissue samples is used to examine the feasibility of gene expression-based tumor classification systems. Using this method, the percentage of correctly classified normal and tumor tissue was 87.0%. The combined approach using principal components and discriminant analysis provided superior sensitivity and specificity compared to an approach using simple differences in the expression levels of individual genes.




1.GeneDn: for high-level expression design of heterologous genes in a prokaryotic system

Li Wu Ju, Lei Hong Xing, Pei Wu Hong and Wu Jia Jin

Bioinformatics 1998, vol.14: 884-885.

RESULTS: Based on the mathematical model of high-level expression of heterologous genes in prokaryotic vector pBV220, we developed a program GeneDn for high-level expression design of natural and synthetic genes. AVAILIBILITY: The program is written in Turbo Pascal 7.0. The source code and related material are available upon request.

Full Text Download:



2.Prediction of RNA secondary structure based on helical regions distribution

Li Wuju and Wu Jiajin

Bioinformatics 1998, vol.14: 700-706.

MOTIVATION: RNAs play an important role in many biological processes and knowing their structure is important in understanding their function. Due to difficulties in the experimental determination of RNA secondary structure, the methods of theoretical prediction for known sequences are often used. Although many different algorithms for such predictions have been developed, this problem has not yet been solved. It is thus necessary to develop new methods for predicting RNA secondary structure. The most-used at present is Zuker's algorithm which can be used to determine the minimum free energy secondary structure. However many RNA secondary structures verified by experiments are not consistent with the minimum free energy secondary structures. In order to solve this problem, a method used to search a group of secondary structures whose free energy is close to the global minimum free energy was developed by Zuker in 1989. When considering a group of secondary structures, if there is no experimental data, we cannot tell which one is better than the others. This case also occurs in combinatorial and heuristic methods. These two kinds of methods have several weaknesses. Here we show how the central limit theorem can be used to solve these problems.

RESULTS: An algorithm for predicting RNA secondary structure based on helical regions distribution is presented, which can be used to find the most probable secondary structure for a given RNA sequence. It consists of three steps. First, list all possible helical regions. Second, according to central limit theorem, estimate the occurrence probability of every helical region based on the Monte Carlo simulation. Third, add the helical region with the biggest probability to the current structure and eliminate the helical regions incompatible with the current structure. The above processes can be repeated until no more helical regions can be added. Take the current structure as the final RNA secondary structure. In order to demonstrate the confidence of the program, a test on three RNA sequences: tRNAPhe, Pre-tRNATyr, and Tetrahymena ribosomal RNA intervening sequence, is performed.
The program is written in Turbo Pascal 7.0. The source code is available upon request.

Full Text Download:






病毒学报,1997,vol.13: 126-133.

摘要: 运用基于螺旋区随机堆积的RNA二级结构预测与密码子偏性计算等序列分析技术,分析了pBV220载体中携带的人白细胞介素2、人白细胞介素4等22个外源基因的表达水平。结果表明:5'端-30~39区域和3'端30~-39区域的二级结构自由能与表达水平具有显著的统计学意义;其次是3'端9bp的局部密码子偏性,SD序列与起始密码子ATG之间碱基数在8±3范围内与表达水平无显著关系。另外,运用判别分析方法构建了判别函数,判别符合率高达 95.5%