背景:We developed a cell-cycle scoring approach that uses expression data to compute an index for every cell that scores the cell according to its expression of cell-cycle genes. In brief, our approach proceeded through four steps. (A) We reduced dimensionality of the dataset to the cell-cycle relevant genes. (B) In this subspace we performed, as a first approximation, a simple K-means clustering to separate non cycling from cycling cells and (C) we used this clustering as a reference to learn a function that takes the gene expression as the input and returns a cell-cycle score as an output. (D) We used this function to calculate a score for each single cell.








We started by selecting a wide selection of genes related to cell-cycle and proliferation. We used the PANTHER GO database and selected all the genes that were described by one of the following terms: DNA metabolic process, DNA replication, mitosis, regulation of cell cycle, cell cycle, cytokinesis, histone, DNA-directed DNA polymerase, DNA polymerase processivity factor, centromere DNAbinding protein. We restricted our features to those genes. Genes that were detected at less than 10 molecules in the dataset were removed. We calculated the pairwise correlation coefficient matrix, and selected the genes that were strongly correlated (99th percentile of the matrix) with at least 12 other genes. The genes passing the filters described above were used for clustering cells using K-means (Python scikit-learn implementation, on log-centered data, default parameters) with the rationale that the main axis of variation expected would span across dividing and non-dividing cells. Then a linear regression model with L1-norm regularization was fitted that used a learning function which took expression data of a cell and categorized into two classes, 1 when a cell belongs to the cycling cluster and 0 when it did not. Importantly, to avoid both overfitting the score on the first approximation clusters and also to obtain a more generalizable model, we used a strong regularization (5 times the one determined by cross-validation; alpha = 0.01).

This procedure was used for both the mouse and human embryonic dataset. The function learnt on the human embryonic dataset was also used to determine the proliferation index of the hPSCs.


1. 首先从PANTHER GO数据库选出cell cycle相关的基因;

2. 计算了每个基因的相关性,去掉了独立存在的基因;

3. K-means聚类分三类,得到学习数据

4. linear regression model with L1-norm,为防止过拟合,参数设得比较严格。


如果想要ground truth,就必须要得到实验上更严格的数据来源,比如高度增殖的细胞和完全不增殖的细胞的基因表达数据。





核心问题是如何选择出合适的gene list!对于有的指标很难选出合适的gene list。


1. 你为什么要重复利用数据,为什么不一次就搞完?kmeans也可以达到分离的效果啊,为什么还要用lasso?

2. kmeans也可以根据到中心的距离来得到打分

3. 本方法的核心问题就是你没有labelled trainset。如果有的话,一个lasso直接解决问题。

4. 其实我觉得这个是一个可行的方法,kmeans是存在over fitting的,我用kmeans只是想得到一个raw的train set,然后用lasso来进行精细打分。

5. lasso的长处就是防止过拟合和筛选变量。

6. 还有一些细节,比如阈值的设计,要自己测试的!




  1. SMON: Parallel transaction recovery tried 引发的问题--转载
  2. haploview出现"invalid affected status"的解决方法
  3. 比较HTML元素和Native组件的区别
  4. 机器学习技法-GBDT算法
  5. lintcode:Matrix Zigzag Traversal 矩阵的之字型遍历
  6. C/C++程序猿必须熟练应用的开源项目
  7. Python爬虫实战(一)
  8. Eclipse用法和技巧二十三:查看JDK源码
  9. S3C6410 纯粹的裸机启动,自己写的SD BOOT启动
  10. JDK+Tomcat+Zookeeper+DubboAdmin安装教程
  11. Function Programming - First Class(一等公民function)
  12. Android执行时ART载入类和方法的过程分析
  13. 在Vue项目使用quill-editor带样式编辑器(更改插入图片和视频)
  14. USACO 状压DP练习[3]
  15. ISP(Interface Segregation Principle),接口隔离原则
  16. Mac下MySQL无my-default.cnf
  17. iOS 循环引用讲解(中)
  18. 一点理解之 CmBacktrace: ARM Cortex-M 系列 MCU 错误追踪库
  19. css 中 stick footer 布局实现
  20. symfony学习笔记2—纯的PHP代码和symfony的区别


  1. 牛客练习赛24题解(搜索,DP)
  2. CEF 文件下载
  3. Windows 登录用户的类型
  4. 【Python026--字典内键方法】
  5. Windows环境下32位汇编语言程序设计笔记-基础篇
  6. 【分库、分表】MySQL分库分表方案
  7. CentOS 7.3 上安装docker
  8. AtCoder Beginner Contest 117 解题报告
  9. Install and Compile MatConvNet: CNNs for MATLAB --- Deep Learning framework
  10. $mount方法是用来挂载我们的Vue.extend扩展的