首    页
  研    究
  论    文
  软    件
    BioSun
  RDfolder
  其  他
  课程讲义
  招生信息
  联系我们
  请您留言

近年发表的论文

 

2007年

1.

Construction of mathematical model for high-level expression of foreign genes in pPIC9 vector and its verification

 

Bingli Wu, Lei Cha, Zepeng Du, Xiaomin Ying, Hua Li, Liyan Xu, Xiaofei Zheng, Enmin Li, Wuju Li

 

Biochemical and Biophysical Research Communications,2007, 354:498–504
  Abstract: In this report, we introduced a mathematical model for high-level expression of foreign genes in pPIC9 vector. At first, we collected 40 heterologous genes expressed in pPIC9 vector, and these 40 genes were classified into high-level expression group (expression level >100mg/L, 12 genes) and low-level expression group (expression level <100mg/L, 28 genes). Then, the Naive Bayes method was used to construct the model with RNA secondary structure profile of 3'-end of foreign genes as features. The classification accuracy from leave-one-out cross-validation was 100%. Finally, another five genes collected from literatures were used to test the ability of the model. The results indicated that there were four genes correctly predicted. In addition, the model was also verified by expressing human neutrophil gelatinase-associated lipocalin (NGAL) gene with expression level more than 100mg/L. Therefore, we propose that the model can be used to predict the expression level of heterologous genes before experiments and optimize the experiment designs to obtain the high-level expression. Furthermore, we have developed a web server for evaluation and design for high-level expression of foreign genes, which is accessible at http://ppic9.med.stu.edu.cn/ppic9

 

Full Text Download:

2.

Predicting siRNA efficiency

 

W. Li and L. Cha

 

Cell. Mol. Life Sci., 2007, 64:1785 – 1792

 

Abstract:Since the identification of RNA-mediated interference (RNAi) in 1998, RNAi has become an effective tool to inhibit gene expression. The inhibition mechanism is triggered by introducing a short interference double-stranded RNA (siRNA,19~27 bp) into the cytoplasm, where the guide strand of siRNA (usually antisense strand) binds to its target messenger RNA and the expression of the target gene is blocked. RNAi has been widely applied in gene functional analysis, and as a potential therapeutic strategy in viral diseases, drug target discovery, and cancer therapy. Among the factors which may compromise inhibition efficiency, how to design siRNAs with high efficiency and high specificity to its target gene is critical. Although many algorithms have been developed for this purpose, it is still difficult to design such siRNAs. In this review, we will briefly discuss prediction methods for siRNA efficiency and the problems of present approaches.

 

Full Text Download:

3.

拟南芥基因组中新的microRNA预测及分析
 

金伟波,孔栋,应晓敏,郭爱光,李伍举

  生物物理学报,23(2007)389-396

 

摘要:MicroRNA(miRNA)是一类存在于动植物体内,长度为21~25nt的内源性小RNA,对生物体的转录后基因调控起着关键作用,但一些低丰度的miRNA和组织特异性miRNA往往很难发现.为了系统识别拟南芥基因组中新的非同源miRNA,首先基于已报道的拟南芥miRNA的特征,从全基因组范围中筛选出453条可能的miRNA前体:其次,为了进一步对上述miRNA前体进行筛选,利用人的miRNA前体数据构建了支持向量机模型GenomicSVM,该模型对人测试集的敏感性和特异性分别为86.3﹪和98.1﹪(30个人miRNA前体和1 000个阴性miRNA前体),对拟南芥测试集的正确率为93.6﹪(78个miRNA前体);最后,利用GenomicSVM预测上述453条miRNA前体序列,得到了37条候选的新的拟南芥miRNA前体,为进一步的miRNA实验发现研究提供了指导.

 

Full Text Download:
 

2006年

1.

基于k-tuple组合酵母ncRNA与mRNA的比较研究

 

李华、应晓敏、查磊、李伍举

 

生物物理学报,2006,22:110-116
 
摘要:ncRNA和mRNA一样,都是重要的功能分子。以k-tuple(k字)含量为特征,对酵母ncRNA成熟序列和mRNA的编码区、上游序列与下游序列进行了分类与比较研究,结果显示:基于ncRNA成熟序列与mRNA编码区的3-tuple的含量,ncRNA和mRNA的交叉有效性分类精度(leave-one out cross—validation,LOOCV)平均值达到93.93%;基于上游序列4-tuple和5-tuple的含量,分类精度分别为92.49%和92.76%;基于下游序列4-tuple和5-tuple的含量,分类精度分别为91.58%和90.60%;利用上游序列和下游序列的4-tuple与5-tuple的含量,其平均分类精度分别为94.68%和94,83%;通过t检验,得到了在ncRNA和mRNA上、下游序列中具有显著统计学差异的k-tuple。上述结果表明,基于ncRNA成熟序列与mRNA编码区的3-tuple含量和基于ncRNA与mRNA上、下游序列的4或5-tuple含量可以有效地区分ncRNA与mRNA。此研究结果不仅有助于准确识别ncRNA与mRNA,还有助于发现ncRNA特异的转录因子结合位点

 

Full Text Download:

2.

BioSun2.0:一个综合性的辅助分子生物学实验设计软件

 

查磊, 应晓敏, 曹源, 李华, 李伍举

 

军事医学科学院院刊,2006,30:461-464

 

摘要:我们曾于2004年推出了计算机辅助分子生物学实验设计的软件系统BioSun 1.0,该系统提供了较为全面的数据处理与分析功能.为了更好地服务于生物医学工作者,我们对该软件系统进行了升级,推出了2.0版本,新增的功能主要有:基于Blast的多种形式的序列比对、基于ClustalW的多序列比对与进化树构建、蛋白质三维结构展示、基于RNAfold的RNA二级结构预测和序列格式转换等.通过与商业化综合性的生物信息学软件系统DNASIS MAX 2.05、DNAStar 5.0、Vector NTI 9.1和BioEdit 7.0 的比较发现,BioSun2.0具有操作简便、功能众多和性价比高等特点,能够满足生物医学实验室的常规需求

 

Full Text Download:

3.

Mprobe 2.0:Computer-Aided Probe Design for Oligonucleotide Microarray
  Wuju Li, Xiaomin Ying
  Applied bioinformatics, 2006, 5:181-186

 

Abstract: DNA chips have proven to be effective tools in detecting gene expression levels. Compared with DNA chips using complementary DNA as probes, oligonucleotide microarrays using oligonucleotides as probes have attracted great attention because of their well known advantages. The design of gene-specific probes for each target is essential to the development of oligonucleotide microarrays. We have previously reported the development of a probe design software termed Mprobe 1.0. Here, we present a new version of this software, termed Mprobe 2.0. Several new features are included in Mprobe 2.0. Firstly, a paradox-based sequence database management system has been developed and integrated into the software, which consequently allows interoperability with sequences in GenBank, EMBL, and FASTA formats. Secondly, in contrast to setting a fixed threshold for the secondary structure of probes in Mprobe 1.0 and other related software, Mprobe 2.0 employs a different method. After parameters such as GC type, probe melting temperature and GC contents have been evaluated, candidate probes are sorted by the free energy from high to low value, followed by specificity analysis. Thirdly, Mprobe 2.0 provides users with substantial parameter options in the visual mode. Mprobe 2.0 possesses an easier interface for users to manage sequences annotated in different formats and design the optimal probes for oligonucleotide microarrays and other applications. AVAILABILITY: The program is free for non-commercial users and can be downloaded from the web page

 

Full Text Download:

 

2005年

1.

How many genes are needed for early detection of breast cancer, based on gene expression patterns in peripheral blood cells?

 

Wuju Li

 

Breast Cancer Research, 2005, vol. 7 (5): E5.
 

Abstract: In their recent report [1], Sharma and coworkers explore the early detection of breast cancer. They analyzed a gene expression data set (1368 genes in 62 normal and 40 tumour samples, including sample duplication in different batches) using the nearest shrunken centroid method. They identified a panel of 37 genes that permitted early detection, with the classification accuracy being about 82%. This is a typical problem with sample classification based on gene expression profiling. The objective is to achieve high prediction accuracy with as few genes as possible, and so feature selection plays an important role; examination of a large number of genes will increase the dimensionality, computational complexity, and clinical cost. According to our previous study of data sets from patients with colon cancer, leukaemia and breast cancer [2], we estimated that five or six genes – rather than 37 -would be sufficient for the early detection of beast cancer [1]. So how many genes are indeed needed? In order to address this question, we evaluated the data presented by Sharma and coworkers using the Tclass system [2].

In the Tclass system, Fisher's linear discriminant analysis and a step-wise optimization procedure for feature selection are used to analyze a batch adjusted data set [1] in two ways. The first is to take the prediction accuracy from the training set as the object function. The second way is to take the classification accuracy from the leave-one-out cross-validation as the object function. For the former, the selected optimal feature sets are evaluated by randomly dividing all tissue samples into a training set (e.g. 50%, 67%, or 85% of samples) and a test set 200 times. The relationship between the prediction accuracy and the number of genes is illustrated in Fig. 1, which shows that the greatest prediction accuracy was achieved using six genes (Fig. 1a); other peaks in accuracy occurred when 10, 13, or 15 genes were used (Fig. 1b). Furthermore, two genes – the 481th (BC009696) and the 801th (BC000514) – permitted classification accuracy as high as 86%, which is greater than the 82% achieved by Sharma and coworkers [1] with the selected 37 genes.

 

Full Text Download:

2.

An approach to studying lung cancer-related proteins in human blood

 

Ting Xiao, Wantao Ying, Lei Li, Zhi Hu, Ying Ma, Liyan Jiao, Jinfang Ma, Yun Cai, Dongmei Lin, Suping Guo, Naijun Han, Xuebing Di, Min Li, Dechao Zhang, Kai Su, Jinsong Yuan, Hongwei Zheng, Meixia Gao, Jie He, Susheng Shi, Wuju Li, Ningzhi Xu, Husheng Zhang, Yan Liu, Kaitai Zhang, yanning Gao, Xiaohong Qian, and Shujun Cheng

 

Molecular & Cellular Proteomics, 2005, published online.

 

Abstract: Early-stage lung cancer detection is the first step towards successful clinical therapy and increased patient survival. Clinicians monitor cancer progression by profiling tumor cell proteins in the blood plasma of afflicted patients. Blood plasma, however, is a difficult cancer protein assessment media, because it is rich in albumins and heterogeneous protein species. We report herein a method to detect the proteins released into the circulatory system by tumor cells. Initially, we analyzed the protein components in the conditional medium (CM) of lung cancer primary cell or organ cultures, and in the adjacent normal bronchus using 1-D PAGE and nano-ESI-MS/MS. We identified 299 proteins involved in key cellular process such as cell growth, organogenesis and signal transduction. We selected 13 interesting proteins from this list, and analyzed them in 628 blood plasma samples using ELISA. We detected 11 of these 13 proteins in the plasma of lung cancer patients and non-patient controls. Our results showed that plasma MMP1 levels were elevated significantly in late-stage lung cancer patients, and that the plasma levels of 14-3-3 sigma, beta and eta in the lung cancer patients were significantly lower than those in the control subjects. To our knowledge, this is the first time that fascin, ezrin, CD98, annexin A4, 14-3-3 sigma, 14-3-3 beta and 14-3-3 eta proteins have been detected in human plasma by ELISA. The preliminary results showed that a combination of CD98, fascin, PIGR/SC and 14-3-3 eta had a higher sensitivity and specificity than any single marker. In conclusion, we report a method to detect proteins released into blood by lung cancer. This pilot approach may lead to the identification of novel protein markers in blood and provide a new method of identifying tumor biomarker profiles for guiding both early detection and therapy of human cancer.

 

Full Text Download:

 

 

2004年

1.

Genome Class Prediction Based on Amino Acid Composition (AAC) from Proteomes

 

Wuju Li, Tao Liu, Xiaomin Ying, and Ming Fan

 

Molecular & Cellular Proteomics, 2004, vol.3 (10): S79.

 

Abstract: With genomic sequences from three domains of life become increasingly available, the relationships between the AAC and the genome classes (organisms' phenotype) have been widely studied in the following two aspects. The first aspect is to concentrate on the difference of AAC of proteins from particular type or whole proteomes in different genome classes. The second aspect is to study the issue of genome class prediction based on the AAC. The purpose of the above two aspects is to explain why certain organisms can live in extreme conditions of temperature, salinity, or pressure. Here we want to emphasize whether there is a possibility to predict the genome classes as accurately as possible using small subsets of amino acids. In order to investigate the issues systematically, the Fisher linear discriminate analysis (FLDA) was applied to the following four data sets DOMAIN, LIFE, HTHAB, and ARCHAEA. The DOMAIN is about the three domains of life (16 archaea, 75 bacteria, and 6 eukaryotic genomes). The LIFE is about the three lifestyles (13 HTH, 4 TH, and 79 MES). The HTHAB includes 10 HTH in archaea and 3 HTH in bacteria. The ARCHAEA is about the three lifestyles in archaea (10 HTH, 3 TH, and 3 MES). By using the feature selection method of all possible combinations of features (amino acids), we found that the cross-validation accuracies for above four data sets could reach 94.8%, 97.9%, 100.0%, and 100.0% by only using the compositions of four (A, I, K, and Q), five (I, K, P, V, and Y), two (E and Q), and two (M and Q) amino acids respectively. The average cross-validation accuracy reaches 98.2%. Therefore, AAC from the proteomes provides an alternative way to determine the genome classes such as the lifestyle or the domains of life. According to what we know, the correspondence analysis, principal component analysis (PCA), and hierarchical cluster analysis have been applied to study the distinction of different genome classes using the AAC, but the classification methods have not been used. Therefore, our work represents a first attempt on this effort in this field.

 

PDF Abstract Download:

   

2.

RDfolder: a web server for prediction of RNA secondary structure

 

Xiaomin Ying, Hong Luo, Jingchu Luo and Wuju Li

 

Nucleic Acids Research, 2004, vol.32: W150-W153.

 

Abstract: Prediction of RNA secondary structure is important in the functional analysis of RNA molecules. The RDfolder web server described in this paper provides two methods for prediction of RNA secondary structure: random stacking of helical regions and helical regions distribution. The random stacking method predicts secondary structure by Monte Carlo simulations. The method of helical regions distribution predicts secondary structure based on the helices that appear most frequently in the set of structures, which are generated by the random stacking method. The RDfolder web server can be accessed at http://rna.cbi.pku.edu.cn.

 

Full Text Download:

 

 

3.

BioSun: 计算机辅助分子生物学实验设计的软件系统

 

李伍举, 应晓敏

 

军事医学科学院院刊2004 vol. 28(5): 401-404

 

摘要:论述了我们自行研究与开发的分子生物学实验辅助设计的生物信息学软件系统BioSun,运行于Windows环境。其主要功能有:可视化的序列编辑、可接收多种序列格式(EMBL, GenBank和FastA)的数据库管理系统、多种方式的序列比较、多种方式的抗原表位预测、基于多种算法的RNA二级结构预测、酶切位点分析及酶切图谱制作、PCR实验辅助设计、辅助寡核苷酸微阵列的探针设计、辅助cDNA微阵列的引物设计和原核系统外源基因高效表达设计等。BioSun系统使用图形用户界面方式,可实现对图形与文本文件的灵活管理,具有操作灵活、功能多样等特点,可用于分子生物学实验辅助设计,对加快实验进程和提高实验的成功率具有较大意义。

 

 

2003年

1.

Samcluster: An integrated scheme for automatic discovery of sample classes using gene expression profile

 

Wuju Li, Ming Fan and Momiao Xiong

 

Bioinformatics, 2003, vol.19: 811-817

 

Motivation: Feature (gene) selection can dramatically improve the accuracy of gene expression profile based sample class prediction. Many statistical methods for feature (gene) selection such as stepwise optimization and Monte Carlo simulation have been developed for tissue sample classification. In contrast to class prediction, few statistical and computational methods for feature selection have been applied to clustering algorithms for pattern discovery.
Results: An integrated scheme and corresponding program SamCluster for automatic discovery of sample classes based on gene expression profile is presented in this report. The scheme incorporates the feature selection algorithms based on the calculation of CV (coefficient of variation) and t-test into hierarchical clustering and proceeds as follows. At first, the genes with their CV greater than the pre-specified threshold are selected for cluster analysis, which results in two putative sample classes. Then, significantly differentially expressed genes in the two putative sample classes with p-values 0.01, 0.05, or 0.1 from t-test are selected for further cluster analysis. The above processes were iterated until the two stable sample classes were found. Finally, the consensus sample classes are constructed from the putative classes that are derived from the different CV thresholds, and the best putative sample classes that have the minimum distance between the consensus classes and the putative classes are identified. To evaluate the performance of the feature selection for cluster analysis, the proposed scheme was applied to four expression datasets COLON, LEUKEMIA72, LEUKEMIA38, and OVARIAN. The results show that there are only 5, 1, 0, and 0 samples that have been misclassified, respectively. We conclude that the proposed scheme, SamCluster, is an efficient method for discovery of sample classes using gene expression profile.
Availability: The related program SamCluster is available upon request or from the web page http://www.sph.uth.tmc.edu:8052/hgc/Downloads.asp
or http://www.biosun.com.cn/softwares/samcluater.html

 

Full Text Download:

 

 

2.

SARS病毒抗原表位预测

 

李伍举. 刘涛.

 

解放军医学杂志 2003 vol.28(6):S9-S10

 

摘要:[目的] 采用集Hopp&Woods亲水性、Janin表面可及性、Karplus-Schulz主链柔软性和电荷分布为一体的综合性抗原表位预测方法和蛋白质二级结构预测对SARS病毒的两个膜蛋白S和M进行抗原表位预测,以便为SARS病毒的疫苗设计提供依据。[结果]通过运用Goldkey等软件分析了SARS病毒的两个膜蛋白S和M的抗原表位,分别获得了14个和7个可能的抗原表位。

  备注:Goldkey的相关功能已集成至我们最新推出的软件BioSun中。
  Full Text Download:
   

3.

传染性非典型肺炎可能病原——新冠状病毒的系统发生学分析

 

刘涛. 李伍举. 范明.

 

解放军医学杂志 2003 vol.28(6):S1-S5

 

摘要:2003年3月以来,一种新冠状病毒(SARS-CoV)被初步确定为2002年底爆发的致死性传染病——严重急性呼吸综合症(Severe Acute Respiratory Syndrome,即SARS)的病原。该病毒具有其他已知冠状病毒典型的基因组结构。对该病毒进行系统发生学分析对进一步的实验研究具有指导意义。我们首先通过构建SARS-CoV在全基因组水平上的系统发生树来明确其演化位置,然后分别从核酸和蛋白两个水平分析了SARS-CoV的5个主要同源蛋白——复制酶、S蛋白、E蛋白、M蛋白和N蛋白的系统发生树。结果表明,SARS-CoV与目前已知的冠状病毒同源,但具有与其它冠状病毒明显不同的特点——各同源基因的演化历史彼此不同,其中结构蛋白基因的演化历史与基因组的演化历史不同;SARS-CoV与IBV和TGV尤其是IBV的亲缘关系较近,尤其是在E蛋白和M蛋白两水平上的特殊近缘关系在进一步的实验研究中值得注意和参考。

  Full Text Download:
   

4.

人NMDA受体主亚基M3-M4环基因片段的高效表达、纯化与鉴定

 

张玉梅. 孙长凯. 范明. 李伍举. 刘淑红. 赵杰. 韩大跃. 王嘉玺.

 

中国生物化学与分子生物学报 2003 vol.19(5):588-593

 

摘要:用基因工程方法获得人N甲基D天冬氨酸(N methyl D aspartate, NMDA)受体主亚基M3 M4环靶片段,以此为免疫原,用于进一步免疫原性及相关应用研究.自人脑胶质瘤组织中提取总RNA ,采用RT PCR扩增出人NMDA受体主亚基M3 M4环的基因片段,并按照计算机辅助原核表达载体pBV220中外源基因高效表达的数学模型预测方法,将其进行优化改构.将目的基因克隆到pBV2 2 0中,转化大肠杆菌DH5α,升温诱导表达,从蛋白质水平检测重组体在大肠杆菌中的表达情况,通过制备性SDS PAGE进行纯化,从相对分子质量、免疫反应性、肽质谱指纹分析等方面进行鉴定.结果表明,成功构建了人NMDA受体主亚基M3 M4环的原核表达载体(命名为pBV NR1L3) ,通过基因优化,实现了高效表达.凝胶扫描分析表达量约占菌体总蛋白29% ,重组肽纯度达95%以上。

 

 

2002年

1.

Tclass: Tumor Classification System Based on Gene Expression Profile

 

Li Wuju and Xiong Momiao

 

Bioinformatics 2002, vol.18: 325-326

 

Summary: A method that incorporates feature selection into Fisher’s linear discriminant analysis for gene expression based tumor classification and a corresponding program Tclass were developed. The proposed method was applied to a public gene expression data set for colon cancer that consists of 22 normal and 40 tumor colon tissue samples to evaluate its performance for classification. Preliminary results demonstrated that using only a subset of genes ranging from 3 to 10 can achieve high classification accuracy.
Availability: The program is written in Matlab and is being rewritten in the Java language. The source code is available upon request.

 

Full Text Download:

 

 

2.

MProbe: computer aided probe design for oligonucleotide microarrays

 

Wuju Li, Jian Huang, Ming Fan, Shengqi Wang

 

Applied Bioinformatics 2002:1(3):163-166.

 

Abstract: The present work describes a complete probe design software system for oligonucleotide microarrays based on Kane’s research on probe sensitivity and specificity (Kane’s rule). Combining Kane’s rule and traditional criteria for probe design we constructed MProbe, the software system for oligonucleotide microarrays using Java. The general criteria for probe design are: (1) probes may have different lengths that range from 20 to 100 bases; (2) they should have a similar melting temperature (Tm) or GC content; (3) they should not contain stable secondary structures; and (4) they abide by Kane’s rule.

 

 

3.

基因表达谱的生物信息学

 

李伍举

 

军事医学科学院院刊,2002 vol.26(1):73-76.

 

摘要:DNA微阵列技术是继DNA重组技术、PCR扩增技术之后的又一重大生物技术。基于微阵列实验,可以同时观察在某一生命现象中成千上万个基因的动态表达水平。与过去的研究模式即单个基因的表达研究相比,分子生物学工作者的观念将由此发生巨大改变,使得人们能够在基因组水平上以系统的、全局的观念去研究生命现象及其本质。目前,微阵列技术已应用到肿瘤分型、肿瘤分类、基因功能研究、基因之间调控网络构建、药物靶位识别等许多方面,但是,从本质上讲,通过微阵列实验所直接获得的是一个基因表达谱(即基因表达矩阵,行表示基因,列表示实验样本),微阵列的实际应用就是通过对基因表达矩阵的生物信息学处理来实现的,因此,在由微阵列技术为基础的分子生物学研究中,生物信息学是其中极其重要的一环,本文就与基因表达谱相关的生物信息学方法作一综述。

 

 

4.

人N-甲基-D-门冬氨酸受体主亚基受体激活相关多肽的理化特性与抗原性分析

 

孙长凯. 赵杰. 李伍举. 冯健男. 刘淑红.等

 

中华医学杂志 2002 vol.82(1):50-53

 

摘要 目的:分析人N2甲基2D2门冬氨酸受体(NMDAR)主亚基NR1a上两个受体激活相关多肽P1、P2的抗原性及其理化特性。方法:用GOLDKEY软件从蛋白质数据库中调出人NR1a分子的氨基酸序列,分别在其第一、第三跨膜域前后逆向、顺向截取151和144个氨基酸长度的多肽片段P1与P2,选取Hopp&Woods与Kyte亲水性、Janin表面可及性、Karplus2Schulz主链柔韧性及Welling抗原性等参数予以多参数分析,采用Prosite程序与Chou2Fasman方法比较其氨基酸位点与二级结构特征,以此为基础综合判定P1与P2片段的抗原位点并与已有的实验结果相比较。结果 :P1、P2多肽片段上可能分别有6和7个8~15aa长序列具有良好的抗原性。P1相关序列主要分布于其氨基端,与配体结合关键氨基酸残基相距较远。P2上的相关序列分布较均匀,包含有受体激活重要相关位点或与配体结合关键氨基酸残基距离较近。P2片段的总体抗原性、亲水性与可及性均强于P1,尤以其近膜的15个残基为著。P1、P2多肽片段均含有一定数量的β2转角,但P1片段含有较多的半胱氨酸残基和无规卷曲,而P2片段则含有较多的芳香族残基并以α螺旋结构为主。结论:人NMDAR主亚基NR1a上的两个受体激活相关多肽P1、P2均具有一定数量的抗原位点,与P1相比较,P2可能更易成为NMDAR免疫干预的分子靶点。

 

 

2001年

1.

Feature (gene) selection in gene expression-based tumor classification

 

Xiong M, Li W, Zhao J, Jin L, Boerwinkle E.

 

Mol Genet Metab. 2001 vol.73(3):239-47.

 

Abstract: There is increasing interest in changing the emphasis of tumor classification from morphologic to molecular. Gene expression profiles may offer more information than morphology and provide an alternative to morphology-based tumor classification systems. Gene selection involves a search for gene subsets that are able to discriminate tumor tissue from normal tissue, and may have either clear biological interpretation or some implication in the molecular mechanism of the tumorigenesis. Gene selection is a fundamental issue in gene expression-based tumor classification. In the formation of a discriminant rule, the number of genes is large relative to the number of tissue samples. Too many genes can harm the performance of the tumor classification system and increase the cost as well. In this report, we discuss criteria and illustrate techniques for reducing the number of genes and selecting an optimal (or near optimal) subset of genes from an initial set of genes for tumor classification. The practical advantages of gene selection over other methods of reducing the dimensionality (e.g., principal components), include its simplicity, future cost savings, and higher likelihood of being adopted in a clinical setting. We analyze the expression profiles of 2000 genes in 22 normal and 40 colon tumor tissues, 5776 sequences in 14 human mammary epithelial cells and 13 breast tumors, and 6817 genes in 47 acute lymphoblastic leukemia and 25 acute myeloid leukemia samples. Through these three examples, we show that using 2 or 3 genes can achieve more than 90% accuracy of classification. This result implies that after initial investigation of tumor classification using microarrays, a small number of selected genes may be used as biomarkers for tumor classification, or may have some relevance in tumor development and serve as a potential drug target. In this report we also show that stepwise Fisher's linear discriminant function is a practicable method for gene expression-based tumor classification.

 

 

2000年

1.

Computational methods for gene expression-based tumor classification

 

Xiong M, Jin L, Li W, Boerwinkle E.

 

Biotechniques. 2000 vol.29(6):1264-8,1270.

 

Abstract: Gene expression profiles may offer more or additional information than classic morphologic- and histologic-based tumor classification systems. Because the number of tissue samples examined is usually much smaller than the number of genes examined, efficient data reduction and analysis methods are critical. In this report, we propose a principal component and discriminant analysis method of tumor classification using gene expression profile data. Expression of 2000 genes in 40 tumor and 22 normal colon tissue samples is used to examine the feasibility of gene expression-based tumor classification systems. Using this method, the percentage of correctly classified normal and tumor tissue was 87.0%. The combined approach using principal components and discriminant analysis provided superior sensitivity and specificity compared to an approach using simple differences in the expression levels of individual genes.

 

 

1998年

1.

GeneDn: for high-level expression design of heterologous genes in a prokaryotic system

 

Li Wu Ju, Lei Hong Xing, Pei Wu Hong and Wu Jia Jin

 

Bioinformatics 1998, vol.14: 884-885.

 

RESULTS: Based on the mathematical model of high-level expression of heterologous genes in prokaryotic vector pBV220, we developed a program GeneDn for high-level expression design of natural and synthetic genes. AVAILIBILITY: The program is written in Turbo Pascal 7.0. The source code and related material are available upon request.

 

Full Text Download:

 

 

2.

Prediction of RNA secondary structure based on helical regions distribution

 

Li Wuju and Wu Jiajin

 

Bioinformatics 1998, vol.14: 700-706.

 

MOTIVATION: RNAs play an important role in many biological processes and knowing their structure is important in understanding their function. Due to difficulties in the experimental determination of RNA secondary structure, the methods of theoretical prediction for known sequences are often used. Although many different algorithms for such predictions have been developed, this problem has not yet been solved. It is thus necessary to develop new methods for predicting RNA secondary structure. The most-used at present is Zuker's algorithm which can be used to determine the minimum free energy secondary structure. However many RNA secondary structures verified by experiments are not consistent with the minimum free energy secondary structures. In order to solve this problem, a method used to search a group of secondary structures whose free energy is close to the global minimum free energy was developed by Zuker in 1989. When considering a group of secondary structures, if there is no experimental data, we cannot tell which one is better than the others. This case also occurs in combinatorial and heuristic methods. These two kinds of methods have several weaknesses. Here we show how the central limit theorem can be used to solve these problems.

RESULTS: An algorithm for predicting RNA secondary structure based on helical regions distribution is presented, which can be used to find the most probable secondary structure for a given RNA sequence. It consists of three steps. First, list all possible helical regions. Second, according to central limit theorem, estimate the occurrence probability of every helical region based on the Monte Carlo simulation. Third, add the helical region with the biggest probability to the current structure and eliminate the helical regions incompatible with the current structure. The above processes can be repeated until no more helical regions can be added. Take the current structure as the final RNA secondary structure. In order to demonstrate the confidence of the program, a test on three RNA sequences: tRNAPhe, Pre-tRNATyr, and Tetrahymena ribosomal RNA intervening sequence, is performed.
AVAILABILITY:
The program is written in Turbo Pascal 7.0. The source code is available upon request.

 

Full Text Download:

 

 

1997年

1.

pBV220载体中外源基因表达水平定量分析

 

李伍举,吴加金

 

病毒学报,1997,vol.13: 126-133.

 

摘要: 运用基于螺旋区随机堆积的RNA二级结构预测与密码子偏性计算等序列分析技术,分析了pBV220载体中携带的人白细胞介素2、人白细胞介素4等22个外源基因的表达水平。结果表明:5'端-30~39区域和3'端30~-39区域的二级结构自由能与表达水平具有显著的统计学意义;其次是3'端9bp的局部密码子偏性,SD序列与起始密码子ATG之间碱基数在8±3范围内与表达水平无显著关系。另外,运用判别分析方法构建了判别函数,判别符合率高达95.5%。

 

首页|研究|论文|BioSun|RDfolder|其他软件|课程讲义|招生信息|联系我们


2005 北京基础医学研究所计算生物学中心  版权所有

Last revised on 2007年12月28日