导图社区 R资料

R资料

学习R的有用笔记，包括创建数据集、图形初阶、基本数据管理、高级数据管理、基本图形、基本统计分析等等。

编辑于2022-09-14 16:56:09 重庆

编程
计算机生…
图形初阶
R资料

水无月空

他的近期作品查看更多>>

缺氧性脑损伤
有关缺血性脑损伤的PPT介绍，缺氧性脑损伤的常见病因包括窒息、严重心律失常、心搏停止、呼吸衰竭、一氧化碳中毒等，这些病因都会导致脑氧供不足，进而引发脑损伤。
血脑屏障
血脑屏障思维导图，血脑屏障（Blood-Brain Barrier，BBB）是指脑毛细血管壁与神经胶质细胞形成的血浆与脑细胞之间的屏障，以及由脉络丛形成的血浆和脑脊液之间的屏障。
选修必修一物理简谐运动
高二物理选修3-4第一章简谐运动全部知识点总结,必备复习资料分享,适用于考前复习,也可以综合其他资料使用。

R资料

社区模板帮助中心，点此进入>>

水无月空

他的近期作品查看更多>>

相似推荐
大纲

互联网9大思维
- 34.2k
- 925
- 2.4k
- 393
MindMaster
组织架构-单商户商城webAPP 思维导图。
- 14.7k
- 3
- 185
- 9
Kacyun
域控上线
- 1.6k
- 163
- 11
- 4
jackrao
python思维导图
- 5.4k
- 531
- 242
- 7
(*^▽^*)
css
- 1.2k
- 1
- 43
- 3
A张舫
CSS
- 3.3k
- 262
- 188
- 33
journey
计算机操作系统思维导图
- 4.2k
- 339
- 204
- 18
journey
计算机组成原理
- 1.5k
- 98
- 70
- 8
journey
IMX6UL(A7)
- 519
- 41
- 5
- 0
Handler XU
考试学情分析系统
- 690
- 50
- 10
- 1
蒋龙

2019.8.24 联川生物培训

R 基础

基础

WINDOWS批量改文件名

方法：https://zhuanlan.zhihu.com/p/502013912?utm_id=0 https://zhuanlan.zhihu.com/p/449741548(TXT要改成ANSI编码才有用）

关联Endnote到Word（Endnote文件夹下configure endnote→一路点，add to word即可

设置环境

快捷键

ctrl+L 清屏

alt+ - 快捷输入<-

函数

getwd()

setwd()

runif() 生成随机数

x <- runif(10000000,min = 1,max = 100)

官网浏览

Taskviews

包

install.packages()

require(vcd) -返回1表示调用包成功，编程开发用

update.packages("vcd")

remove.packages()

detach("packages")

(.packages())

显示目前有哪些包

.libPaths()

包地址

帮助

help

？

help( package="ggplot2" )

Bioconductor

work flows

包含了生物学需要的各个工作流程

install.packages("BiocManager")

BiocManager::install()

chooseBioCmirror()

选择镜像

包排名

http://bioconductor.org/packages/stats/

文件操作

R适合处理结构化数据，不适合处理非结构化数据

CSV格式

，分隔

tsv格式

Tab分隔

读取文件

路径，文件名，分隔符

“”

“，”，“\t”（TAB）

header=T,F

row.names,数字，设置哪一列为行名

read.table(file=,sep="",header=,rownames=)，header默认F，rownames默认0

read.csv(na.string=""

规定na数据显示格式

skip=numeric,（跳过前几行）

nrow=,读取文件部分行

ncol

head(x)

显示前几行

head(x,numeric)

提取前若干行

table,csv,txt

csv默认有分隔符，可以不加sep

csv2

欧洲国家用；分隔，处理适用

openxlsx,xlsx包

library(xlsx) Error: package or namespace load failed for ‘xlsx’: loadNamespace()里算'rJava'时.onLoad失败了，详细内容：调用: fun(libname, pkgname) 错误: JAVA_HOME cannot be determined from the Registry

待解决

z<-read.xlsx("",sheet=numeric)

""打了后可快速补齐文件名

学习笔记

https://bookdown.org/zyf19940501/Rbook/data-openxlsx.html

xlsx

读取复杂表格

https://paul.rbind.io/2019/02/01/tidying-multi-header-excel-data-with-r/

read_excel读有多个ROW表头的表格

https://readxl.tidyverse.org/articles/multiple-header-rows.html

https://www.w3.org/WAI/tutorials/tables/multi-level/

重命名文件

# Define working directory working_directory <- "working/dic/" # get list of current file # names as character vector current_files current_files <- list.files(working_directory) # get file names after renaming # as character vector new_files new_files <- c("geeks1.txt","geeks2.txt") # Use file.rename() function to rename files file.rename(paste0(working_directory, current_files), paste0(working_directory, new_files))

https://blog.csdn.net/sinat_26917383/article/details/51100736

写文件

ls(）

列出环境中有的文件

write.table

write.table(x,file="mmm.csv",sep=",")

读写R文件

.RDS./RData

RDS-single database

RData-Multiple database

处理大数据时，R文件较小

saveRDS(iris,file="iris.RDS")

R配置

.Rprofile

file.edit("~/.Rprofile")

options(CRANS=)

library(,quietly=T)

每次就会自动加载

quietly禁止显示加载信息

#options(prompt="$")

改变分隔符

#options(prompt="$") #options (continue="+") #.libPaths("C:/Users/Think/Documents/R/win-library/3.5/") options(CRANs="https://mirror.lzu.edu.cn/CRAN/") options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/") .First <- function() { # library(BioCManager,quietly = TRUE) cat ("\nWelcome at",date(),"\n") } .Last <- function() { cat ("\nGoodbye at ",date(),"\n") }

平时可以用记事本编辑.Rprofile

Rstudio

https://blog.csdn.net/weixin_55500638/article/details/121635557

数据结构

类型

数字

字符串

逻辑值

判断数据类型

class(variable）

去掉某些数据

rivers[c(-2,-6,-100)]

选取数据

rivers[seq(1,141,by=2)]

rivers[(c(T,F)]

匹配选取

高效

rivers[rivers>500]

rivers[order(rivers)]

缺失值

处理数据包vim

范例数据集sleep

概要

用斗地主抽牌练习提取数

type <- c("red","spades","cube","plum")

amount <- c("A",2:10,"J","Q","K")

a<-expand.grid(amount,type)

m= paste(a$Var1,a$Var2,sep="-")

m<-c(m,c("black joker","red-joker"))

shuffe<-sample(m,54,replace=F)

dipai<-shuffe[c(52,53,54)]

shuffe<-shuffe[-c(52,53,54)]

A<-shuffe[c(T,F,F)] B<-shuffe[c(F,T,F)] C<-shuffe[c(F,F,T)]

数据索引与操作

加入信息

cbind

rbind

数学检验

vcd包

Arthritis数据包，类风湿

RNASEQ

表达矩阵+样本信息

读入R

转换DESeq类

过滤掉未表达

评估组内组间差别

DESeq()

筛选

可视化

基因功能注释

WGCNA

加权重共表达

计算相关性矩阵-邻接矩阵-拓扑矩阵

R in action

Chapter2 创建数据集

数据结构

向量

创建

a<-c(1,2,3,6）

访问

a[c(2,4)]

a[3]

a[2:6]

矩阵

创建

matrix<-matrix(vector, nrow=, ncol=,byrow=logical, dimnames=list(char_rownames，char_colnames)

byrow=T 按行填充，默认为按列填充

y<-matrix(1:20,nrow=5,ncol=4)

y<-matrix(cells,nrow=2,ncol=2,byrow=T,dinmaes=list(rnames,cnames))

访问

y[2,]

y[,2]

y[1,4]

y[1,c(2,3)]

state.x77矩阵数据

rownames,colnames()

t(x)

转置矩阵

数组

创建

z<-array(vector,dimensions,dimnames)

z<-array(1:24,c(2,3,4),dimnames=list(dim1,dim2,dim3))

访问

类似矩阵

数据框

创建

mydata<-data.frame(col1,col2....)

col1等列向量可为任何类型，名称可由names函数指定

访问

使用类似矩阵的下标

直接指定列名

$选取数据框中某个特定变量

attach（）

detach()

添加路径索引

r attach(what, pos = 2L, name = deparse(substitute(what)), backtick = FALSE, warn.conflicts = TRUE) what：要添加的数据框或列表。 pos：添加的路径存储的位置，一般默认即可。在对多个数据添加索引时，此位置会变成3L、4L、5L等。 name：通常不需要手动设置，由函数自动生成。 backtick：固定为FALSE，调用索引时会用到。 warn.conflicts：是否打印警告‌ detach()函数的作用 detach()函数用于撤销索引路径，即撤销对应位置的索引储存‌

with()

with括号内赋值仅在with内起效，除非用特殊赋值<<-

within()

不同于with，可以修改数据框

row.names=

指定数据框实例标识符

数据mtcars

取数据学习

x[x$N3==max(x$N3),]$N1

条件判定取数

因子

概念

特殊向量，以整数向量存储类别值

diabetes<-c("t1","t2"...)

diabetes<-factor(diabetes)

存为1,2,1,1

ordered=TRUE

有序排列

levels=vector

自定义排序

df$class <- factor(df$class,levels=c(2,4),labels=c("benign","malignant"))

格式

https://blog.csdn.net/weixin_46587777/article/details/124985611

factor(x,levels,labels,ordered=T/F)

计算因子水平数

nlevels()

计算因子数值水平

summary()

列表

创建

mylist<-list(object1,object2....)

解除列表为向量

unlist()

处理列表数据

https://blog.csdn.net/sinat_26917383/article/details/51123214

https://blog.51cto.com/u_16175473/11672760

类（set）

本质是列表

测序数据会遇到

Chapter.3 图形初阶

dev

dev.new()

dev.next()

dev.prev()

dev.set()

dev.off()

dev.cur

图形参数

符号，线条

pch,cex,lty,lwd

对pch(21-25)可指定col=,bg=（颜色，填充色）

颜色

文本

图形尺寸，边界尺寸

文本，自定义坐标轴，图例

标题

坐标轴

参考线

图例

文本标注

数学标注

图形组合

par()

mfrow=c(nrows,ncols)

nfcol=c(nrows,ncols)

按列填充矩阵

layout()

layout(matrix)

layout(matrix(c(1,1,2,3),2,2,byrow=TRUE))

laytou(...,widths=c(3,1),height=c(1,2))

精准调控大小

图形布局精细控制

par(fig=c(0,0-.8,0,0.8))

x1,x2,y1,y2

Chapter.4基本数据管理

创建新变量

算术运算符

^或**

求幂

x%%y

求模（x mod y)

x%/%y

整数除法

mydata$sumx<-mydata$x1+mydata$x2

attach mydata$sumx=x1+x2 detach(mydata)

mydata<-transform(mydata,sum=x1+x2)

变量的重编码

根据现有值创建新值，改变类型

连续变量---类别值

误编---正确值

分数线---创建表示及格，不及格

逻辑运算符

<,<=,>,>= ==,!=,!x(非x），x|y 或，x&y,isTRUE(x)

leadership$agecat[leadership$age > 75] <- "Elder" leadership$agecat[leadership$age >= 55 & leadership$age <= 75] <- "Middle Aged" leadership$agecat[leadership$age < 55] <- "Young"

leadership <- within(leadership,{ agecat <- NA agecat[age > 75] <- "Elder" agecat[age >= 55 & age <= 75] <- "Middle Aged" agecat[age < 55] <- "Young" })

car包

recode()函数方便编码数据类型

doby包

recodevar()

R中自带的cut（）

将一个数值型变量按值域切割为多个区间，返回一个因子

变量的重命名

fix()

开交互式窗口更改

reshape包

rename()

rename(dataframe,c(oldname="newname")

names()

缺失值

不能比较，只能被找出

is.na()

分析中排除缺失值

na.omit()

删除所有带NA的行

日期值

通常以字符串输入，然后转化为数值形式存储

as.date(x,"input format")实现转化

%d 日期

01-31

%a 星期名

Mon

Monday

%m 月份

00-12

%b 缩写月份

Jan

%B 非缩写

January

两位数年份

四位数年份

2007

默认输出格式为yyyy-mm-dd

strDates <- c("01/05/1965", "08/16/1975") dates <- as.Date(strDates, "%m/%d/%Y")

Sys.date（）返回当天日期

date()返回当前日期和实践

format()

提取指定格式日期

today <- Sys.Date() format(today, format="%B %d %Y") format(today, format="%A")

更多知识

help(as.Date)

help(strftime)

help(ISOdatetime)

关于日期格式

lubricate包

包含日期简化处理函数

fCalendar包

复杂处理函数

类型转换

转换函数

判断

is.numeric,charcater,vector,matrix,data.frame,factor,logical

变换

as.numeric,charcater,vector,matrix,data.frame,factor,logical

数据排序

order()

数据合并

添加列

merge()

total<-merge(dataframeA,dataframeB,by="ID")

根据ID进行合并

纵向

cbind()

相同行数，且要以相同顺序排列

添加行

横向

rbind()

相同变量，但顺序不一定要相同

有多余变量的话

删除

在另一数据框中追加变量，设为NA

通常用于添加观测

取子集

选入变量

dataframe[row indices, column indices]

剔除变量

myvars <- names(leadership) %in% c("q3", "q4") leadership[!myvars]

names()生成含所有变量名的字符变量，q3,q4是否在其中形成逻辑值，然后！反转

leadership[c(-8,-9)]

leadership$q3<-NULL

选入观测

newdata <- leadership[1:3,] newdata <- leadership[leadership$gender=="M" & leadership$age > 30,] attach(leadership) newdata <- leadership[which(gender=='M' & age > 30),] detach(leadership)(加不加which都一样）

日期数据选入

startdate <- as.Date("2009-01-01") enddate <- as.Date("2009-10-31") newdata <- leadership[which(leadership$date >= startdate & leadership$date <= enddate),]

subset函数

newdata<-subset(leadership,age>=35|age<24,select=c(q1,q2,q3,q4)

newdata <- subset(leadership, gender=="M" & age > 25, select=gender:q4)

:冒号表示从...到

随机抽样

mysample<-leadership[sample(1:nrow(leadership),3,replace=FALSE)

抽取大小为3的样本

第一个值为抽取范围的向量，第二个值为抽取数目

replace=F

放回

replace=T

无放回

抽样种子

set.seed()

每个数字绑定一个sample

sample()

set.seed(1) sample(10,5)

更完备的抽样工具

sampling包

抽取，校正

survey包

分析复杂调查数据

Chapter.5 高级数据管理

数值和字符处理函数

数学函数

abs(x)

绝对值

sqrt(x)

平方根

ceiling(x)

不小于X的最小整数

floor(x)

不大于X的最小整数

trunc(x)

向0的方向截取X中整数部分

round(x,digits=n)

舍入为指定n位的小数，默认为0

signif(x,digits=n)

舍入为指定位数的数，signif(3.45,digits=2)---3.5

cos,sin,tan(x)

acos,asin,atan(x)

cosh(x),sinh(x),tanh(x)

双曲三角函数

acosh,asinh,atanh(x)

反双曲余弦、反双曲正弦和反双曲正切

log(x,base=n)

默认自然对数

log10(x)

exp(x)

指数函数

统计函数

mean(x)

median(x)

sd(x)

标准差

var(x)

方差

mad(x)

绝对中位差

用原数据减去中位数后得到的新数据的绝对值的中位数

quantile(x,probs）

分位数，x为数值型向量，probs为【0,1】之间的概率值组成的数值向量

y<-quantile(x,c(0.3,0.84))

30%,84%分位点

range(x)

值域

range(c(1,2,3,4))=c(1,4)

diff(range(c(1,2,3,4))=3

min(x)

max(x)

scale(x,center=TRUE,scale=TRUE)

为数据对象x按列中心化（center=TRUE)或标准化（center=TRUE,scale=TRUE)

默认对指定列进行均值为0，标准差为1的标准化

对每一列进行任意均值和标准差的标准化，可用newdata<-scale(x)*SD+M,SD为标准差，M为均值

若只对指定列而不是整个矩阵或数据框标准化（有非数值列等情况），可用newdata<-transform(mydata,myvar=scale(myvar)*10+50,(均值50，标准差10）

diff(x,lag=n,difference=n

滞后差分，lag用以指定滞后几项，difference指执行几次

概率函数

格式（dpqr+_分布缩写）

密度函数density

x<-pretty(c(-3,3),30) y<-dnorm(x) plot(x,y,type="l",xlab="NomralDeviate",ylab="Density",yaxs="i")

dnorm

yaxs=i

i=internal，独立y轴区间

yaxs=r

r=regular

分布函数distribution function

位于z=1.96左侧标准正态曲线下方面积？

pnorm(1.96)

0.975

分位数函数quantile function

均值500，标准差100的正态分布0.9分位点是？

qnorm(0.9,mean=500,sd=100)

628.16

生成随机数random

生成50个均值为50，标准差为10的正态随机数

rnorm(50,mean=50,sd=10)

常用概率函数

Beta分布

beta

二项分布

binom

柯西分布

cauchy

（非中心）卡方分布

chisq

指数分布

exp

正态分布

norm

泊松分布

pois

t分布

F分布

均匀分布

unif

Gamma分布

gamma

几何分布

geom

超几何分布

hyper

对数正态分布

lnorm

Logistics分布

logis

多项分布

multinom

负二项分布

nbinom

Wilcoxon符号秩分布

signrank

Weibull分布

weibull

Wilcoxon秩和分布

wilcox

生成随机数种子

runif()

生成0-1区间服从均匀分布的伪随机数

runif(5)→set.seed(1234）后就可以用set.seed记录这个随机序列，然后重现（reproducible）结果了

生成多元正态数据

模拟研究和蒙特卡洛方法中，需获取来自给定均值向量和协方差阵的多元正态分布的数据，MASS包中mvrnorm（）函数可以解决该问题

mvrnorm(n,mean,sigma）

sigma是方差-协方差矩阵

字符处理函数

nchar(x)

计算x中字符数

substr(x,start,stop)

提取或替换子串

x<-"abcd" substr(x,1,3)--"abc" substr(x,1,3)<-"22222----"222d"

grep(pattern,x,ignore.case=FALSE,fixed=FALSE)

在x中搜索某模式，fixed=FALSE时，pattern为正则表达式，否则为字符串，返回匹配的下标

表达式写法

https://www.jianshu.com/p/11bbfa8e98c5?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source=recommendation

sub(pattern,replacement,x,ignore.case=FALSE,fixed=FALSE)

在x中搜索pattern，并以文本replacement替换

strsplit(x,split,fixed=FALSE)

在字符split处分割x

或者可以unlist(x[下标]）

paste(...,sep="")

连接字符串，分隔符为sep

toupper(x)

tolower(x)

大小写转换

其他实用函数

seq(from,to,by)

生产一个序列

rep(x,n)

重复x n次

cut(x,n）

将连续变量x分割为有n个水平的因子

pretty(x,n)

创建完美分割点。将x分割为n个区间

cat(...,file="myfile",append=FALSE

连接对象并输出到屏幕或文件中

toString(）

从一串字母连成一个字符串

结果会用,隔开

stringr包下的函数

str_replaceall()

替换字符串里所有符合条件字符

str_replace

apply函数组（应用到数据框，矩阵）

apply(x,margin,fun,...)

x为对象，margin=1 行， margin=2 列，fun=所用函数

apply(mydata, 2, mean, trim=.2)

舍去后尾20%数据后计算中间60%的平均值

5.3 数据处理问题示例

options(digits=2)限定输出小数点后数字位数=2

5.4 控制流

概念

语句(statement）

一条单独的R语句或一组符合语句（在一个{}中，用分号分隔

条件(cond）

一条解析为T或F的表达式

表达式（expr）

一条数值或字符串的求值语句。

序列(seq）

一个数值或字符串序列

重复和循环

for

for （i in 1:10） {statement}

while

while{cond} statement

条件执行

if-else/ifelse/switch

if-else

if {cond} statement

if {cond} statement else statement2

ifelse

ifelse{cond},statement1,statement2)

switch(expr...)

feelings <- c("sad", "afraid") for (i in feelings) print( switch(i, happy = "I am glad you are happy", afraid = "There is nothing to fear", sad = "Cheer up", angry = "Calm down now" ) )

自编函数

myfunction<-function(arg1,arg2,...){statements return(object) }

整合与重构

转置

t()

转置矩阵或数据框

整合数据

aggregate(x,by,fun)

by是一个由变量名组成的列表，这些变量将被去掉，形成新的观测

例子

options(digits=3) attach(mtcars) aggdata <-aggregate(mtcars, by=list(cyl,gear), FUN=mean, na.rm=TRUE) aggdata

reshape2

融合

md <- melt(mydata, id=c("ID", "Time"))

每个测量变量占一行，行中带有唯一确定这个测量所需的标识符变量

重铸

融合之后，cast()读取已融合的数据，并使用提供的公式和一个（可选的）用于整合数据的函数将其重塑。调用格式为newdata<-cast(md,formula,FUN)

md为融合后数据

formula描述想要的结果

FUN是可选函数

代码

md <- melt(mydata, id=c("ID", "Time")) # reshaping with aggregation dcast(md, ID~variable, mean) dcast(md, Time~variable, mean) dcast(md, ID~Time, mean) # reshaping without aggregation dcast(md, ID+Time~variable) dcast(md, ID+variable~Time) dcast(md, ID~variable+Time)

体会整合/不整合差别

数据归一化

https://blog.csdn.net/guyuealian/article/details/78845031

Chapter.6 基本图形

6.1 条形图

简单条形图

barplot（data,main=,xlab,ylab,horiz=T/F)

如果需绘制的类别向量是因子或有序因子，可以用函数plot（）快速创建垂直条形图

6.1.2 堆砌条形图/分组条形图

数据来源如果不是矩阵而是向量

beside=F(默认）

堆砌条形图

beside=T

分组条形图

6.1.3 均值条形图

数据整合后算出

lines（）

将各个条形用线段连接

gplots包中的barplot2（）函数创建叠加有置信区间的均值条形图

6.1.4 条形图微调

标签字体大小，方向等

cex.names

字体大小

names.arg

标签

6.1.5 棘状图

6.2 饼图

pie(x,labels）

扇形图

fan.plot（x,labels）

6.3 直方图

hist(x)

breaks（组数）

6.4 核密度图

向已经有的图叠加密度曲线

lines()

产生新的密度图

plot(density(x))

x 数值向量

用核密度图和sm包比较组间差异

sm.density.compare()可向图形叠加两组或更多核密度图

sm.density.compare(x,factor）

x 数值型向量

factor 分组向量

6.5箱线图

boxplot

数值

boxplot.stats（）

使用并列箱线图跨组比较

boxplot(formula,data=dataframe）

formula是公式，如y~A,代表类别型变量A的每个值并列生成数值型变量y的箱线图 y~A*B 代表类别型变量A和B所有水平的两两组合生成数值型变量y的箱线图

参数varwidth=T

使箱线图宽度与样本大小平方根成正比

参数notch=T

有凹槽的箱线图

凹槽不重叠，表明中位数有差异

小提琴图

viopolot(x1,x2,...,names=,col=)

=核密度图叠加箱线图

6.6点图

dotchart(x,labels=)

x 数值向量

labels 点的标签组成的向量

group=分组因子

gcolor=控制不同分组标签的颜色

Chapter 7 基本统计分析

7.1 描述性统计分析

summary（）

向量化计算

https://blog.csdn.net/qq_36026791/article/details/77482543

apply()

sapply()

sapply(x,FUN,options)

偏度峰度

mystats <- function(x, na.omit=FALSE){ if (na.omit) x <- x[!is.na(x)] m <- mean(x) n <- length(x) s <- sd(x) skew <- sum((x-m)^3/s^3)/n kurt <- sum((x-m)^4/s^4)/n - 3 return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt)) }

文献

https://www.guru99.com/r-apply-sapply-tapply.html

sapply() function sapply() function takes list, vector or data frame as input and gives output in vector or matrix. It is useful for operations on list objects and returns a list object of same length of original set. Sapply function in R does the same job as lapply() function but returns a vector.

fivenum()

包括最小最大值，上下四分位数，中位数

其他包

psych包

library(psych) myvars <- c("mpg", "hp", "wt") describe(mtcars[myvars])

提供关于vars n mean sd median trimmed mad min max range skew kurtosis se的数据

descibe（）

Hmisc

library(Hmisc) myvars <- c("mpg", "hp", "wt") describe(mtcars[myvars])

pastecs

7.1.2 分组计算描述性统计量

myvars <- c("mpg", "hp", "wt") aggregate(mtcars[myvars], by=list(am=mtcars$am), mean) aggregate(mtcars[myvars], by=list(am=mtcars$am), sd)

注意list(am=ctcars$am)

多个分组变量时

by=list(name1=groupvar1,name2=groupvar2，。。。)

用by（）分组计算描述性统计量

aggregate一次只能算一个

by(data,indices,fun)

indices--由一个或多个因子组成的列表，定义分组

doby包和psych包也提供分组计算描述性统计量的函数

library(doBy) summaryBy(mpg+hp+wt~am, data=mtcars, FUN=mystats)

library(psych) myvars <- c("mpg", "hp", "wt") describeBy(mtcars[myvars], list(am=mtcars$am))

reshape包按组导出描述性统计量

dfm<-data.frame,measure.vars=y,id.vars=g

y 向量，指明需概述的数值型变量（默认使用所有变量）

g 由一个或多个分组变量组成的向量

cast(dfm,groupvar1,groupvar2+...+varable~.，FUN）

7.1.3 结果可视化

直方图

密度图

箱线图

点图

7.2 频数表和列联表

7.2.1 生成频数表

函数

table(var1,var2...）

使用N个类别型变量（因子）创建N维列联表

xtabs（formula,data）

根据公式和一个矩阵或数据框创建N维列联表

prop.table(table,margins)

依margins定义的边际列表，将表中条目表示为分数形式

margin.table(table,margins)

按margins定义的边际列表计算表中条目的和

Addmargins(table,margins）

将概述边margins（默认是求和结果）放入表中

ftable(table）

创建紧凑的“平铺”式列联表

输出列联表

1.一维列联表

# frequency tables library(vcd) head(Arthritis) # one way table mytable <- with(Arthritis, table(Improved)) mytable # frequencies prop.table(mytable) # proportions prop.table(mytable)*100 # percentages

2.二维列联表

mytable<-table(A,B)

mytable<-xtabs(~A+B,data=mydata)

mydata是一个矩阵或者数据框，要进行交叉分类的变量在右侧，左侧作为频数向量

用margin.table()和prop.table()分别生成边际频数和比例

margin.table(mytable,1) #row sums margin.table(mytable, 2) # column sums prop.table(mytable) # cell proportions prop.table(mytable, 1) # row proportions prop.table(mytable, 2) # column proportions addmargins(mytable) # add row and column sums to table

addmargins(prop.table(mytable)) addmargins(prop.table(mytable, 1), 2) addmargins(prop.table(mytable, 2), 1)

3.多维列联表

上述函数均可

mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis) mytable ftable(mytable)

margin.table(mytable, 1) margin.table(mytable, 2) margin.table(mytable, 2) margin.table(mytable, c(1,3)) ftable(prop.table(mytable, c(1,2))) ftable(addmargins(prop.table(mytable, c(1, 2)), 3))

Improved None Some Marked Treatment Sex Placebo Female 0.59375000 0.21875000 0.18750000 Male 0.90909091 0.00000000 0.09090909 Treated Female 0.22222222 0.18518519 0.59259259 Male 0.50000000 0.14285714 0.35714286

Improved None Some Marked Sum Treatment Sex Placebo Female 0.59375000 0.21875000 0.18750000 1.00000000 Male 0.90909091 0.00000000 0.09090909 1.00000000 Treated Female 0.22222222 0.18518519 0.59259259 1.00000000 Male 0.50000000 0.14285714 0.35714286 1.00000000

注意1,2,3对应treatment, sex, improved和输出结果

7.2.2 独立性检验

检验类别型变量独立性

1.卡方独立性检验

library(vcd) mytable <- xtabs(~Treatment+Improved, data=Arthritis) chisq.test(mytable) mytable <- xtabs(~Improved+Sex, data=Arthritis) chisq.test(mytable)

p值表示相对独立的概率，>0.05表示独立

2.Fisher精确检验

mytable <- xtabs(~Treatment+Improved, data=Arthritis) fisher.test(mytable)

fisher.test（）可以在任意行列数大于2的二维列联表上使用，但不能用于2X2列联表

3.Cochran-Mantel-Haenszel检验

原假设：两个名义变量在第三个变量的每一层中都是条件独立的，检验假设中不存在三阶交互作用（治疗X改善X性别）

library(vcd) mytable <- xtabs(~Treatment+Improved, data=Arthritis) assocstats(mytable)

7.2.3 相关性度量

assocstats（）

计算二维列联表phi系数，列联洗漱和Cramer's V系数，较大值代表较强相关性

7.2.4 结果可视化

条形图

马赛克图和关联图

集合表示

Nenadic^Greenacre,2007

7.2.5 将表转换为扁平格式

table.flat

7.3 相关

描述定量变量之间的关系，相关系数符号±表明相关方向，值的大小表示相关程度（0-1）

定量变量也就是通常所说的连续量，如长度、重量、产量、人口、速度和温度等，它们是由测量或计数、统计所得到的量，这些变量具有数值特征，称为定量变量

定性变量

7.3.1 相关类型

1.Pearson/Spearman/Kendall相关

系数

Pearson积差相关系数-两个定量变量之间的线性相关

Spearman等级相关系数

衡量分级定序变量之间相关程度

Kendall's Tau相关系数

非参数的等级相关度量

图

cor()

计算三种相关系数

cor(x,use=,method=)

矩阵或数据框

use

指定缺失数据的处理方式

all.obs(假设不存在缺失数据，遇到缺失数据报错）

everything（遇到缺失数据时，相关系数的计算结果将设为missing）

complete.obs（行删除）

pairwise.complete.obs（成对删除）

method

pearson

spearman

Kendall

cov()

计算协方差

代码

states<- state.x77[,1:6]

cov(states)

cor(states)

cor(states, method="spearman")

以spearman相关系数显示（默认pearson）

不以方阵形式显示

x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")]

y <- states[,c("Life Exp", "Murder")]

cor(x,y)

上述结果没有知名相关系数是否显著不为0（即能否下相关的假设）

需要相关性的显著性检验

相关图corrlogram

比较大量的相关系数

见11章

7.4 t验证

独立样本t检验

调用格式

t.test(y~x,data)

数值型变量

二分型变量

t.test(y1,y2)

y1,y2为数值型向量（各组结果变量）

注意点

默认方差不相等，并使用welsh？修正自由度，可添加参数var.equal=TRUE假定方差相等，并使用合并方差估计

默认备择假设是双侧的，即均值不相等，但大小方向不定可添加参数alternative=less或greater进行有方向检验

代码

library(MASS) t.test(Prob ~ So, data=UScrime)

非独立样本t检验

格式

t.test(y1,y2,paired=TRUE)

sapply(UScrime[c("U1","U2")], function(x)(c(mean=mean(x),sd=sd(x)))) with(UScrime, t.test(U1, U2, paired=TRUE))

7.4.3

多余两组→方差分析

第九章

7.5 组间差异的非参数检验

不满足t检验或ANOVA的假设，用非参数方法

结果变量本质上严重偏倚或呈现有序关系

7.5.1 两组的比较

格式

wilcox.test(y~x,data)

data的取值为一个包含这些变量的矩阵或数据框

默认双侧，可添加参数exact进行精确检验，指定alternative=less或greater

wilcox.test(y1,y2)

代码

Mann-Whitney U 检验

with(UScrime, by(Prob, So, median)) wilcox.test(Prob ~ So, data=UScrime)

Wilcoxon符号秩检验

适用于两组成对数据和无法保证正态性假设时，格式同上，可添加参数paired=TRUE

sapply(UScrime[c("U1", "U2")], median) with(UScrime, wilcox.test(U1, U2, paired=TRUE))

7.5.2 多于两组的比较

Kruskal-Wallis检验

适用于各组独立

格式

kruskal.test(y~A,data）

数值型结果变量

一个拥有两个或更多水平的分组变量（grouping variables）

Friedman检验

适用于各组不独立（重复测量

格式

friedman.test(y~A|B,data)

数值型结果变量

A是一个分组变量，B是一个用以认定匹配观测的去租变量(blocking variable）

代码

states <- data.frame(state.region, state.x77) kruskal.test(Illiteracy ~ state.region, data=states)

组与组之间成对比较

npmc包

chapter8 回归

指用一个或多个预测变量（也称自变量或解释变量）预测响应变量（因变量，效标变量或结果变量）

8.1 回归的多面性

http://cran.r-project.org/doc/contrib/Ricci-refcardregression.pdf

普通最小二乘（OLS）回归

适用情境

通过预测变量的加权和来预测量化的因变量

权重

通过数据估计得出的参数

参考书

Applied regerssion analysis and Generalized Linear Models（偏理论）

An R and S-Plus Companion to APPlied Regression(偏应用）

非技术性综述

参考Licht（1995）

8.2 OLS回归

形式

n为观测的数目

k为观测变量的数目

第i次观测对应的因变量的预测值（已知预测变量值的条件下，对Y分布估计的均值）

第i次观测对应的第j个预测变量值

截距项（当所有的预测变量都为0时，Y的预测值）

预测变量j的回归系数（斜率表示改变一个单位引起的Y的该变量

目标

通过减少响应变量的真实值与预测值的差值，获得模型参数（截距项和斜率），使得残差平方和最小

为能恰当解释OLS模型的系数，数据必须满足下列假设

正态性

对于固定的自变量值，因变量值成正态分布

独立性

值之间相互独立

线性

因变量与自变量之间为线性相关

同方差性

因变量的方差不随自变量水平不同而变化

8.2.1 用lm(）拟合回归模型

R中，拟合线性模型的基本函数lm（）

格式

myfit<-lm(formula,data)

myfit为结果，格式为List,储存你和模型的大量信息

formula格式

Y~X1+X2+...+Xk

~左边为响应变量，右边为预测变量

一些常用函数

修改表达式的公式

分隔符号，左为响应变量，右为预测变量，如通过x,z,w预测y，代码y~x+z+w

分隔预测变量

：

表示预测变量的交互项，如通过x,z及x,z的交互项预测y，代码y~x+z+x:z

表示所有交互项，y~x*z*w，也可复杂写为y~x+z+w+x:z+z:w+...

表示交互项达到某个次数，y~(x+z+w)^2可展开为y~x+z+w+x:z+z:w+x:w

意思是只到2次交互？

可表示包含除因变量外所有变量

y~.=y~x+y+w

减号，从等式中移除某个变量

-l

删除截距项

y~x-l，拟合y对x,且强制直线通过原点

I（）

从算数角度解释括号中的元素，如y~x+(z+w)^2会展开，而y~x+I((z+w)^2)会将后者视为平方

function

可在表达式中用的数学函数，如log(y)~x+z+w

对拟合线性模型有用的其他函数

summary（）

显示拟合模型的详细结果

coefficients（）

列出拟合模型的模型参数（截距项和斜率）

confint()

提供模型参数的置信区间（默认95%）

fitted（）

列出拟合模型的预测值

residuals()

列出拟合模型的残差值

anova（）

生成拟合模型的方差分析表，或比较两个或更多拟合模型的方差分析表

vcov（）

列出模型参数的协方差矩阵

AIC（）

输出赤池信息统计量

plot（）

生成评价拟合模型的诊断图

predict（）

用拟合模型对新数据集预测相应变量值

8.2.2 简单线性回归

fit <- lm(weight ~ height, data=women) summary(fit) women$weight fitted(fit) residuals(fit) plot(women$height,women$weight, main="Women Age 30-39", xlab="Height (in inches)", ylab="Weight (in pounds)") # add the line of best fit abline(fit)

8.2.3 多项式回归（允许X,X平方等）

fit2 <- lm(weight ~ height + I(height^2), data=women) summary(fit2) plot(women$height,women$weight, main="Women Age 30-39", xlab="Height (in inches)", ylab="Weight (in lbs)") lines(women$height,fitted(fit2))

多项式等式仍是线性回归模型，因为等式仍是预测变量的加权和形式，比如log(X1)+sin(X2)仍然可视为线性拟合，除非是用指数函数等，非线性模型可用nls函数拟合nls（）

用car包中的scatterplot()函数绘制二元关系图

library(car) library(car) scatterplot(weight ~ height, data=women, spread=FALSE, smoother.args=list(lty=2), pch=19, main="Women Age 30-39", xlab="Height (inches)", ylab="Weight (lbs.)")

spread=FALSE删除了残差正负均方根在平滑曲线上的展开和非对称信息

lty.smooth=2设置loess拟合曲线为虚线

pch=19设置点位实心圆

参考资料

https://blog.csdn.net/qq_36523839/article/details/82924804

用计算方法估算多少次方的多项式最适合拟合，计算MSE？

8.2.4 多元线性回归

当预测变量不止一个，简单线性回归就变为多元线性回归

多项式回归可算多元线性回归特例，如二次回归有X，X^2两个预测变量等

因为lm()需要数据框，故先将数据矩阵进行处理

states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

多元相关分析中，第一步最好检查变量间相关性

采用cor（）函数获取两变量之间相关性

cor(states)

car包中scatterplotMatrix（）可生成散点图矩阵

library(car) scatterplotMatrix(states, spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix")

scatterplotMatrix默认在非对角线绘制变量间的散点图，并添加平滑(loess）和线性拟合曲线

对角线区域绘制每个变量的曲度图和轴须图

多元相关分析

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) summary(fit)

预测变量不止一个时，回归系数代表一个预测变量增加一个单位，其他预测变量保持不变因变量将要增加的数量

8.2.5 有交互项的多元线性回归

fit <- lm(mpg ~ hp + wt + hp:wt, data=mtcars) summary(fit)

通过effects包中的effects()函数绘制交互项结果

library(effects) plot(effect("hp:wt", fit,, list(wt=c(2.2, 3.2, 4.2))), multiline=TRUE)

8.3 回归诊断

判断模型在多大程度上满足统计假设

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) confint(fit)

2.5 % 97.5 % (Intercept) -6.552191e+00 9.0213182149 Population 4.136397e-05 0.0004059867 Illiteracy 2.381799e+00 5.9038743192 Income -1.312611e-03 0.0014414600 Frost -1.966781e-02 0.0208304170

标准方法

常见方法为使用plot函数，生成评价模型拟合情况的四副图形

opar <- par(no.readonly=TRUE) fit <- lm(weight ~ height, data=women) par(mfrow=c(2,2)) plot(fit) par(opar)

正态性

预测变量固定时，因变量正态分布，残差值应该是均值为0的正态分布。

正态QQ图正态分布对应的值下，标准化残差的概率图

若满足正态假设，图上的点落在呈45度角的直线上

独立性

只能从收集的数据去验证

线性

若因变量与自变量线性相关，那残差值与预测（拟合）值就没有任何系统关联换句话说，除了白噪声，模型应该包含数据中所有的系统方差

residuals vs fitted图明显看到有曲线关系，暗示需要加上一个二次项

同方差性（方差齐性）

在位置尺度图中scale-location graph

满足假设的话水平线周围的点应随机分布

残差与杠杆图 residuals vs leverage

提供单个观测点信息，可鉴别离群点，高杠杆值点和强影响点

离群点

表示拟合模型对其预测不佳

杠杆值

一个点该值高表明异常预测变量值

8.3.2更好的方法

car包

qqplot()

分位数比较图

durbinWastonTest()

对误差自相关性做Durbin-Watson检验

crPlots()

成分与残差图

ncrTest（）

对非恒定的误差方差做得分检验

spreadLevelPlot()

分散水平检验

outlierTest

Bonferroni离群点检验

avPlots()

添加的变量图形

infulencePlot()

回归影响图

scatterplot()

增强的散点图

scatterplotMatrix()

增强的散点图矩阵

vif（）

方差膨胀因子

gvlma包

提供对所有线性模型假设进行检验的方法

1.正态性

library(car) states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) qqPlot(fit, labels=row.names(states), id.method="identify", simulate=TRUE, main="Q-Q Plot")

id.method=“identity”醒购交互式绘图，（感觉没实现），鼠标点上显示函数中label选项的设定值

qqplot()提供更为精确的正态假设检验

画出在n-p-1个自由度的t分布下的学生化残差（studentized residual），也称血生化删除残差或折叠化残差图形，n是样本大小，p是回归参数的数目（包括截距项）

residplot()函数可生成学生化残差柱状图（直方图），并添加正态曲线，he密度曲线和轴须图

residplot <- function(fit, nbreaks=10) { z <- rstudent(fit) hist(z, breaks=nbreaks, freq=FALSE, xlab="Studentized Residual", main="Distribution of Errors") rug(jitter(z), col="brown") curve(dnorm(x, mean=mean(z), sd=sd(z)), add=TRUE, col="blue", lwd=2) lines(density(z)$x, density(z)$y, col="red", lwd=2, lty=2) legend("topright", legend = c( "Normal Curve", "Kernel Density Curve"), lty=1:2, col=c("blue","red"), cex=.7) }

比QQ图更直观

2.误差的独立性

1.根据先验知识判断

2.时间序列数据的自相关性，相隔时间近的观测相关性大于相隔远的

car包的DurbinWatson检验函数，能检测误差的序列相关性

durbinWatsonTest(fit)

3.线性

成分残差图(component plusresidual plot) 也叫偏残差图（partial residual plot）

library(car) crPlots(fit）

结果显示因变量与自变量是否存在非线性关系

不同于已设定线性模型的系统偏差

提示可能需要添加曲线成分，如多项式，或对一个，多个变量变换，不用线性回归

4.同方差性

car包

判断方差是否恒定

library(car) ncvTest(fit) spreadLevelPlot(fit)

ncvTest（）

生成计分检验，零假设为误差方差不变

备择假设为误差方差随拟合值水平变化而变化，检验显著说明方差不恒定

spreadLevelePlot()

创建一个添加了最佳拟合曲线的散点图，显示标准化残差绝对值与拟合值的关系

方差齐时，曲线水平

8.3.3 线性模型假设的综合验证

gvlma（）

能对线性模型假设综合验证，做偏斜度，峰度，异方差性等评价

library(gvlma) gvmodel <- gvlma(fit) summary(gvmodel)

8.3.4 多重共线性

比如自变量有出生日期和年龄，做Y对该两数值的回归

回归系数测量的是其他预测变量不变时，某预测变量对响应变量的影响，这种问题就称为多重共线性multicollinearity

会导致模型参数的置信区间过大使单个系数解释起来很困难

WHY？

用统计量VIF（variance inflation factor，方差膨胀因子）进行检测，平方根（vif）>2表明存在多重共线性问题

library(car) vif(fit) sqrt(vif(fit)) > 2 # problem?

参考

https://baike.baidu.com/item/%E5%A4%9A%E9%87%8D%E5%85%B1%E7%BA%BF%E6%80%A7/10201978?fr=aladdin

8.4 异常观测值

8.4.1 离群点

模型预测效果不佳的观测点，通常有很大，或正或负的残差

正残差说明模型低估了响应值，负说明高估了

方法1

QQ图，落在置信区间带歪为离群点

方法2

标准化残差值绝对值大于2 可能是离群点

方法3

car包outlierTest（）函数

library(car) outlierTest(fit)

8.4.2 高杠杆值点

与其他预测变量有关的离群点，即许多一场的预测变量值组合，与响应变量值没有关系

通过帽子统计量判断，帽子均值为p/n，p是模型估计的参数数目（包括截距项），n是样本量。若观测点的帽子值大于帽子均值的2或3倍，即可认定为高杠杆值点

hat.plot <- function(fit) { p <- length(coefficients(fit)) n <- length(fitted(fit)) plot(hatvalues(fit), main="Index Plot of Hat Values") abline(h=c(2,3)*p/n, col="red", lty=2) identify(1:n, hatvalues(fit), names(hatvalues(fit))) } hat.plot(fit)

高杠杆值点可能会是强影响点，也可能不是，要先看是否是离群点

8.4.3 强影响点

即对模型参数估计值影响有些比例失衡的点。例如，移除某点对模型造成巨大改变。两种方法可以检测强影响点。

Cook距离，或称D统计量/变量添加图（added variable plot）

Cook's D值大于4/（n-k-1)表明是强影响点

样本量大小

预测变量数目

cutoff <- 4/(nrow(states)-length(fit$coefficients)-2) plot(fit, which=4, cook.levels=cutoff) abline(h=cutoff, lty=2, col="red")

Cook图只鉴别强影响点，但不提供关于点如何影响模型的信息。

library(car) avPlots(fit, ask=FALSE, id.method="identify")

变量添加图，对每个预测变量Xk，绘制Xk在其他k-1个预测变量上回归的残差值相对于响应变量在其他k-1个预测变量上回归的残差值的关系图

library(car) influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

infulencePlot（）

将离群点，杠杆值和强影响点整合到一幅图中

8.5 改进措施

8.5.1 删除观测点

谨慎

8.5.2 变量变换

当模型不符合正态性，用

library(car) summary(powerTransform(states$Murder))

谨慎对待

8.5.3 增删变量

删除存在多重共线性的变量

岭回归——多元回归的变体，用于处理多重共线性

8.6 选择最佳的回归模型

绘图

barplot

names.arg(axis label)

barplot(x$length,names.arg=x$chr)

条形图下方标签

beside=TRUE

堆叠还是放旁边

horiz=T

水平条形图

density=numeric（绘图线密度），可给向量，多种密度

angle=numeric（线方向）

boxplot

boxplot(len~supp,data=ToothGrowth)

notch=T

展示中位数

hist

breaks=

直方图分块数

nclass=

rug()

地毯，下方显示密度

freq=T/F

纵坐标是频数还是频率

heatmap

需要用矩阵，不行就as.matrix，默认读入是数据框

参考线

abline

在图中绘制直线

abline(h=yvalues, x=values)

子主题

https://www.cnblogs.com/xudongliang/p/6755727.html

图例

legend

location

x,y

locator(1)

然后鼠标单击给出

关键字（bottom,bottomleft,topleft,top,topright,right,center)等

加inset=numeric，指定图例向图形内侧移动大小

以绘图区域大小分数表示

title

图例标题

legend

图例标签字符串向量

plot

X-Y绘图

绘制函数

curve(x^2, from=1, to=50, xlab="x", ylab="y")

You can also use curve when you have a predfined function

eq = function(x){x*x} curve(eq, from=1, to=50, xlab="x", ylab="y")

eq = function(x){tan(x*pi)} curve(eq, from=-10, to=10, xlab="x", ylab="y")

eq = function(x){tanpi(x)}

ggplot

library("ggplot2") eq = function(x){x*x} ggplot(data.frame(x=c(1, 50)), aes(x=x)) + stat_function(fun=eq)

使用 ggTimeSeries 包构建日历图

https://zhuanlan.zhihu.com/p/614766481

绘制多个图

http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

ggplot中添加直线

https://www.sthda.com/english/wiki/ggplot2-add-straight-lines-to-a-plot-horizontal-vertical-and-regression-lines

lattice

library(lattice) eq<-function(x) {x*x} X<-1:1000 xyplot(eq(X)~X,type="l")

https://stackoverflow.com/questions/26091323/how-to-plot-a-function-curve-in-r

type

图形形式

opar<-par(no.readonly=TRUE) par(opar)

保持图形参数不变

符号和线条

pch

指定符号

cex

符号大小

cex.lab/cex.main/cex.axis/cex.sub

lty

线条类型

lwd

线条宽度

颜色

colors()

颜色总览

RColorBrewer包

调色板

display.brewer.all()

brewer.pal

选色板（打上述查询后的名字）

col

其他类似参数

col.axis

col.lab

col.main

col.sub

fg/bg

前景，背景

表达形式

"white"

"FFFFFF"

RGB(1,1,1)

hsv(0,0,1)

连续颜色函数

rainbow(),heat.colors(),terrain.colors(),topo.colors(),cm.colors()

ggplot2中颜色

https://www.zhihu.com/question/29085362

文本

font

整数

1=常规，2=粗体，3=斜体，4=粗斜体，5=符号字体（Adobe符号编码表示）

font.axis

font.lab

font.main

font.sub

字体磅值

最终大小为ps*cex

family

绘制字体所用的字体组，标准为serif（衬线）、sans(无衬线）、mono（等宽）

标题main副标题sub坐标轴标签xlab/ylab坐标轴范围xlim,ylim

通过ann=FALSE移除图中默认标签

子主题

标题

title()

坐标轴

axis(side,at=,labels=,pos=,lty=,col=,las=,tck=)

side(1,2,3,4)(下左上右）

at(数值向量，刻度位置）

labels（字符向量，文字标签，为null将使用at的值）

pos(坐标轴线绘制位置的坐标，即与另一轴相交位置值）

lty,col

las(标签与坐标轴关系）（0-3）

tck（刻度线长度）

负值表示在图外侧，正在内，0为禁用刻度，1表示绘制网格，默认0.01

次要刻度线

Hmisc包

minor.tick()函数

minor.tick(nx=n,ny=n,tick.ratio=n)x,y区间个数，大小相对主要刻度比例

当前主要刻度线长度可用par("tck“）获取

文本标注

text(location,"text",pos,...)

绘图区域内部

mtext("text",side,line=n,...)

向图形边界添加文本

数字标注

plotmath()

图形尺寸，边界尺寸

英寸表示的宽，高

mai

数值向量表示的边界大小

下左上右，单位英寸

mar

数值向量表示的边界大小

下左上右，单位英分

默认c(5,4,4,2)+0.1

par(pin=c(4,3),mai=c(1,.5,1,.2)

图排布】

https://blog.51cto.com/u_16213445/11871462

patchwork

用layout函数

用grid

ggplot2

混合箱图和density

data <- data.frame( Group = rep(c("Group A", "Group B"), each = 100), Value = c(rnorm(100, mean = 10, sd = 2), rnorm(100, mean = 8, sd = 3)) ) library(ggplot2) p1 <- ggplot(data, aes(x = Value, fill = Group)) + geom_density(alpha = 0.5) + labs(x = "Value", y = "Density") + theme_minimal() library(ggpubr) p2 <- ggplot(data, aes(x = Group, y = Value, fill = Group)) + geom_boxplot(alpha = 0.5) + stat_compare_means(label.y = 0)+ #位置换数据要调 labs(x = "Group", y = "Value") + theme_minimal()+ theme(legend.position = "none")+ coord_flip() library(patchwork) p1/p2 + plot_layout(heights = c(2,1),guides = "collect")& scale_fill_brewer(palette = "Set1")

patchwork只用于ggplot

彩虹图

https://jtr13.github.io/cc21fall2/raincloud-plot-101-density-plot-or-boxplotwhy-not-do-both.html

函数

数字

取整

round(x,digits=numeric)

四舍五入

floor

ceiling

trunk

生成序列

seq(1,100,by=2）

rep(char,numeric)

rep(each/times=numeric)

paste(a,b,sep="")

把两个集合粘到一起

只要数目是倍数关系就可以匹配

paste0

区别和用法

https://www.r-bloggers.com/2016/10/difference-between-paste-and-paste0/

expand.grid

按所有可能组合粘贴

排序

sort()

decreasing=T

排数据

=rivers[order(rivers)]

scorelist$name<-scorelist$Name[order(scorelist$Name)]

按某元素排序，并新建元素保存

order()

按索引排序号

可按多个参数排列，前后顺序是有意义的

mtcars[order(mtcars$cyl,mtcars$mpg),]

加-号表示按递减排列

a <- seq(1,50,by=1) a<-rep(a,times=2) b<-rep(c("A","B"),each=50) paste(b,a,sep="")

Apply

教程

https://www.datacamp.com/community/tutorials/r-tutorial-apply-family?utm_source=adwords_ppc&utm_campaignid=1565261270&utm_adgroupid=67750485268&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=1t1&utm_creative=332661264374&utm_targetid=aud-392016246653:dsa-473406586995&utm_loc_interest_ms=&utm_loc_physical_ms=9002000&gclid=EAIaIQobChMI-YXy5fuf5AIVDKmWCh2Dowi6EAAYASAAEgJnNfD_BwE

https://www.zhihu.com/search?type=content&q=apply%20sapply%20lapply

apply函数是这三个函数的精髓，lapply是它的变种，sapply是lapply的精华简洁版本。apply函数的基本命令格式是：apply(数据，向量，函数)。这个“数据、向量、函数”是什么意思呢？数据好理解，就是我们要计算的数值，在apply函数中数据通常是矩阵；函数也好理解，是我们计算的方法，那么向量指什么呢？用数学语言来说，向量指的是矩阵的行或列，在apply函数中，向量值为1指的是将函数应用到矩阵的“每一行”上，向量值为2指的是将函数应用到矩阵的“每一列”上。通俗的说，向量为1的时候横着算，向量为2的时候竖着算。apply函数整体就是一个填字游戏，把每一个字符放到合适的地方，函数会自己运算出整体结果 sapply函数运算方法与lapply完全相同，唯一的区别是sapply输出形式并不是列表（list），而是“数值”。

作者：R语言与医学生链接：https://zhuanlan.zhihu.com/p/368438746 来源：知乎著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

https://blog.csdn.net/wltom1985/article/details/50403720

apply

sapply

tapply

按组计算

X‌：要进行操作的向量。 ‌INDEX‌：一个或多个因子的列表，每个因子的长度与X相同。 ‌FUN‌：一个函数，用于对X进行计算，默认为mean。 ‌simplify‌：逻辑值，默认为TRUE。如果为TRUE且FUN的计算结果总是标量，则返回一个数组；如果为FALSE，则返回一个list对象。

整洁数据

tidyr

library(tidyr)

生成整洁数据

tdata <- mtcars[1:10,1:3] tdata <- data.frame(name=rownames(tdata),tdata) gather(tdata,key = "Key",value = "Value",cyl,disp,mpg)

数据框长宽格式转换

pivot_longer

pivot_wider

概要

https://tidyr.tidyverse.org/reference/pivot_longer.html

controldata<--controldata %>% pivot_longer(cols=starts_with("Group-"),names_to ="Group",values_to = "Value") expdata<--expdata %>% pivot_longer(cols=starts_with("Group-"),names_to ="Group",values_to = "Value")

dplyr

数据框处理

https://zhuanlan.zhihu.com/p/346322562

% in %

dplyr包

filter函数

filter(birthwt,age>35)

filter(birthwt,bwt<2500|bwt>4000)

filter(birthwt,age>35,bwt<2500|bwt>4000)

slice函数

等于取子集

slice(birthwt,2:5)

arrange函数（排序）

arrange(birthwt,bwt,age)

arrange(birthwt,desc(bwt）=arrange(birthwt,-bwt)

按数值反向排序

select函数

select(birthwt,age,race,smoke)

选子集，且按某顺序，也可用于排序

mutate函数（创建新列）

mutate(birthwt,lwt=lwt*0.4536)

mutate(birthwt,lwt.kg=lwt*0.4536

group_by函数

生成tibble表格

group_by(birthwt,race)

as_tibble(birthwt)

summarise函数

birthwt.group<-group_by(birthwt1,race) summarise(birthwt.group,mean(bwt))

或者用传递函数，可以省掉中间变量

birthwt %>% mutate(race=factor(race,labels=c("white","black","other"))) %>% group_by(race)

数据框合并转换

纵向

bind_rows()

横向

bind_cols()

按某个变量

fulljoin(df1,df2)

inner_join()

left_join()

right_join()

语法

apply(X, MARGIN, FUN) Here: -x: an array or matrix -MARGIN: take a value or range between 1 and 2 to define where to apply the function: -MARGIN=1`: the manipulation is performed on rows -MARGIN=2`: the manipulation is performed on columns -MARGIN=c(1,2)` the manipulation is performed on rows and columns -FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied>

unique(）去重复值

做拟合

只适合做符合正态分布数据

glm

广义线性

可做二项分布

fit <- glm(class~.,data=df.train,family=binomial())

family处可写入其他分布

class与其他的关联

数据平滑

https://blog.csdn.net/shiteater/article/details/132621434

https://blog.51cto.com/u_16213304/7440660

sample

随机抽样

replace=F

放回

replace=T

无放回

抽样种子

set.seed()

每个数字绑定一个sample

sample()

set.seed(1) sample(10,5)

清理操作界面

CTRL+L

数据读取

list.files 读取文件夹里文件名

用法一

list.files(path, pattern, all.files, full.names)

读取一个路径下的文件名

用法2

list.files(path=".", pattern=".pdf", all.files=TRUE, full.names=TRUE)

只读取pdf

fread

https://www.rdocumentation.org/packages/data.table/versions/1.14.10/topics/fread

https://zhuanlan.zhihu.com/p/563631324

包括list.files,dir,list.dirs

变量名和调用相关

根据变量名分配数据

#分每个实验小组，计算每次的分数，生成多个图，另外还可以分出专业进行计算 for (i in seq(1, length(unique(scorelist$Group)))){ var <- paste0('heban', i) ## 变量名 data <- scorelist[scorelist$Group==i,] ## 变量值 assign(var, data) ## 把变量值分配给变量名 }

根据变量名提取数据

方法1

https://blog.csdn.net/weixin_41578567/article/details/103328382·

解释

https://skume.net/entry/2021/12/12/124649

用eval(paste(var_name））方法，主要读取单变量

方法2 可读取数据框

# 创建一个示例数据框 df <- data.frame(a = 1:5, b = 6:10, c = 11:15) # 变量名存储在字符串中 var_name <- "b" # 使用get()函数读取数据 variable_data <- get(var_name, envir = df) # 输出结果 print(variable_data) 在这个例子中，get(var_name, envir = df)将会返回df中名为"b"的变量的值。注意，envir = df参数指定了查找变量的环境是df数据框。如果你的变量在全局环境中，则不需要指定envir参数。

data_frame_name <- "my_data" # 使用assign或get函数来获取数据框 data_frame <- get(data_frame_name) # 查看数据框 print(data_frame)

获取文件名字符

tools::file_path_sans_ext("ABCD.csv") [1] "ABCD" > basename("ABC.csv") [1] "ABC.csv"

https://stackoverflow.com/questions/29113973/get-filename-without-extension-in-r

查看源代码方法

https://zhuanlan.zhihu.com/p/216217782

数据整理

tidyr

https://r4ds.hadley.nz/data-visualize

融合列

https://www.r-bloggers.com/2024/10/how-to-combine-rows-with-same-column-values-in-r/

统计检验

数学检验

卡方检验

chisq.test()

chisq.test(Arthritis$Improved,Arthritis$Treatment)

chisq.test(table(Arthritis$Improved,Arthritis$Treatment))

t检验

t.test()

apply(x[1:4],1,function(i){t.test(i[1:4],i[5:8],paired =T)})

自定义函数格式

apply函数

apply(x,1,function(i){t.test(i[1:4],i[5:8],paired =T)$p.value})

只要Pvalue

p值矫正

p.adjust()

qvalue

相关性检验

round(cor(state.x77),digits=2)

psych包

corr.test()

统计

频数

xtabs(~ Improved+Sex,data=Arthritis)

~表示相关

同table()

回归

线性回归

lm()

lm(height~weight,data=women)

lm(Murder~.data=state)

.代表数据框中除了Murder的所有数据

子主题

fit2 <- lm(Murder~Population+Income+Illiteracy+Life Exp,data=state)

调整模型

AIC（fit,fit2)

AIC评估，值越大越好

summary(lim())

获得残差值

R2--（0-1）模型解释度

coef()

系数项

fitted()

risid()

残差值

predict()

用公式预测新值

购物篮分析

关联分析

arules包

data Groceries

特殊格式

Inspect(Groceries)

查看数据

函数模型

fit <- apriori(Groceries,parameter=list(support=0.01,confidence=0.5))

support参数

confidence参数

图像化

arulesviz包

inspectDT()

x<-read.csv("prok_representative.csv",header=T) plot(x, method = NULL, measure = "support", shading = NA, interactive = NULL, engine = "default", data = NULL, control = NULL)

机器学习

70%学习，30%验证

1.读文件，2.

文件

head(x)

显示前几行

head(x,numeric)

提取前若干行

批量产生文件名

for (i in seq(1, 10)){ + var <- paste0('var', i) ## 变量名 + data <- i + 10 ## 变量值 + assign(var, data) ## 把变量值分配给变量名 + }

保存图

原始方法

1. 原生的方法分为3步：（1）创建画布（2）绘图（3）关闭画布 # 1. 创建画布 png( filename = "name.png", # 文件名称 width = 10, # 宽 height = 10, # 高 units = "in", # 单位 bg = "white", # 背景颜色 res = 300) # 分辨率 # 2. 绘图 plot(1:5) # 3. 关闭画布 dev.off() 此外类似的方法还有jpeg(), bmp(), tiff(), pdf(), svg()

ggsave\

ggsave只能保存基于ggplot2绘图的图片。只要是使用ggplot2绘图的都推荐使用ggsave保存图片 library(ggplot2) p <- ggplot(mtcars, aes(wt, mpg)) + geom_point() # ggsave 会默认保存上一个ggplot对象 ggsave( filename = "name.png", # 保存的文件名称。通过后缀来决定生成什么格式的图片 width = 7, # 宽 height = 7, # 高 units = "in", # 单位 dpi = 300 # 分辨率DPI ) )

cairo

library(Cairo) # Cairo.capabilities() # 检查当前电脑所支持的格式 # 1. 创建画布 Cairo::CairoPNG( filename = "name.png", # 文件名称 width = 7, # 宽 height = 7, # 高 units = "in", # 单位 dpi = 300) # 分辨率 # 2. 绘图 plot(1:5) # 3. 关闭画布 dev.off() # 此外类似的方法还有CairoJPEG(), CairoTIFF(), CairoPDF(), CairoSVG()等

list.files()等函数应用

https://zhuanlan.zhihu.com/p/563631324

数据结构

str(x)

查看数据结构

str(x) 'data.frame': 200 obs. of 8 variables: $ N1: num 62051 2863 1383 770 631 ... $ N2: num 55642 3258 1988 872 569 ... $ N3: num 95407 5896 3148 1986 1332 ... $ N4: num 52992 3103 1893 830 542 ... $ T1: num 0.0001 56.6184 0.0001 193.951 8.0459 ... $ T2: num 0.0001 6.1565 0.0001 213.2456 0.9109 ... $ T3: num 0.0001 5.3311 0.0001 201.5717 0.5686 ... $ T4: num 0.0001 8.918 0.0001 165.8947 2.0619 ..

class(x)

查看数据类别

class(x) [1] "data.frame"

typeof(x)

is. as.

查看有哪些方法

methods（is）

NA缺失值

na.omit()

删除包含NA的行，列

is.na()

判断是否有NA

安装包的问题

#解决联网下载问题（对安装包很重要—）！ options(download.file.method = 'libcurl') options(url.method='libcurl')

Installation paths not writeable, unable to update packages

install.packages("zoo", lib="C:/software/Rpackages")

原神配色方案

https://github.com/RestlessTail/ggGenshin

library(Genshinpalette) #采用原神配色（该包内三个函数） char_name=get_character_name(Country = "All")#Country可以是原神中各国家 head(char_name,10)#若出现乱码请检查系统编码格式是否为UTF-8 ALBEDO ALOY AMBER BARBARA BENNETT DIONA DILUC FISCHL KAEYA KLEE "阿贝多" "埃洛伊" "安柏" "芭芭拉" "班尼特" "迪奥娜" "迪卢克" "菲谢尔" "凯亚" "可莉" Genshinpalette('BARBARA',#角色名称 6)#所需颜色个数，默认为6 [1] "#2C3350" "#727782" "#9C9997" "#BEB8BA" "#CDC2B9" "#F4EFEA" display_colors('FISCHL')

https://stack.xieguigang.me/2023/genshin-impact-color-palettes/

word

更新域

https://www.pdftodoc.cn/news-1420.html

逻辑

&和 |或！否

which

which.max(x)

自建函数

函数内定义全局变量方法

https://blog.csdn.net/weixin_43677260/article/details/86615923

生存函数实例

https://blog.csdn.net/weixin_48093827/article/details/126886707?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-0-126886707-blog-73740250.235^v43^pc_blog_bottom_relevance_base4&spm=1001.2101.3001.4242.1&utm_relevant_index=3

人卫读取数据函数

#创建固定长度空列表，加入判断，如果原列表有元素不是逻辑数据，跳过创建下一个（避免反复读写），只有第一次需要list(），后续只运行crlist(number）即可 listrw<-list() crlist<-function(x){ i<-1 print(i) print(x) while (i<x+1){ if (x>length(listrw)){ if (length(listrw)==0 ){ listrw<<-append(listrw,list(T),after=1) i<-i+1 } else if (!i==1) {if (is.logical(listrw[[i-1]])){ listrw<<-append(listrw,list(T),after=1) i<-i+1 print(i) } } else {i<-i+1} } else {break} } }

#注意数据存储方法，用print显示中间变量。 #添加了else if用来排除反复用这个函数已经读入了数据，又读一次的现象 #注意<-，和<<- <<-影响全局变量 listRW<-list(c(1,1),c(2,2),c(3,3),c(4,4),c(5,5)) ReadRW<-function(x){ if (x<0){warning ("wrong input") } else {for (i in 1:x) { if (is.null(ncol(listRW[[i]]))){var<- paste0(i,(".xlsx")) FILE<-file.choose(var) s1ad<-dirname(FILE) filead <- file.path(s1ad,var) print(var) listRW[[i]]<<-read.xlsx(filead) print(head(listRW[[i]])) print(paste0("read,",i)) print(filead) print(colnames(listRW[[i]])[1:13]) colnames(listRW[[i]])[1:13]<<-c("班级","用户名","姓名","学号","状态","参考时间","交卷时间","作答题数","答题时长","ip地址","终端","切屏次数","首次作答成绩") listRW[[i]]<<-listRW[[i]][-1,] print(colnames(listRW[[i]])) } else if (ncol(listRW[[i]])>5){next } } } }

浮动主题

单细胞测序

教程来源

https://zhuanlan.zhihu.com/p/544443428

https://mp.weixin.qq.com/s/YHJXm17qPrxuCVsZivHHNw

逆时分析

https://blog.csdn.net/m0_45210226/article/details/127759572

原理讲的比较查清楚的整体教程

https://zhuanlan.zhihu.com/p/325902055

测序数学原理

https://www.zhihu.com/column/c_1431016570223468544

Seurat源代码分析

https://swbioinf.github.io/scRNAseqInR_Doco/de2.html

https://blog.csdn.net/zengwanqin/article/details/114655570

https://cloud.tencent.com/developer/article/2317522

https://blog.csdn.net/qq_52813185/article/details/134647852

https://cloud.tencent.com/developer/article/2394167

https://cloud.tencent.com/developer/article/2425291

https://swbioinf.github.io/scRNAseqInR_Doco/de2.html

https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2021.779125/full

https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html

B站教程

https://www.bilibili.com/video/BV1144y1F7rF?spm_id_from=333.788.videopod.sections&vd_source=07b3ae20355d18fee7c57b4801b7d648

https://www.bilibili.com/opus/982788190648664084

可视化教学

单细胞基础流程

https://satijalab.org/seurat/articles/essential_commands.html

https://zhuanlan.zhihu.com/p/599440147

单细胞与脑

https://zhuanlan.zhihu.com/p/437591346

Dimheatmap解读、

几种单细胞和其他技术的联合的中文简述

https://mp.weixin.qq.com/s/n_B4pdsHgifE8hKwACVBXw

Vizdimloadings函数

可视化主成分分析中影响大的基因，参考上述PCA分析

VizDimLoadings( object, dims = 1:5, nfeatures = 30, col = "blue", reduction = "pca", projected = FALSE, balanced = FALSE, ncol = NULL, combine = TRUE )

Jackstrawplot

JackStrawPlot函数提供了一个可视化工具，用于用均匀分布(虚线)比较每个PC的p值分布。“显著的”PCs将显示出一个低p值(虚线以上的实线)的强富集特性。在这种情况下，在最初的10-12个pc之后，重要性似乎急剧下降。

优化DOTPLOT

http://www.360doc.com/content/23/0711/09/65403234_1088112409.shtml

https://ggplot2.tidyverse.org/reference/theme.html

p1 + theme(panel.grid.major = element_line(colour = "black"))

https://ggplot2.tidyverse.org/reference/theme.html

cell ranger

https://blog.csdn.net/weixin_47890536/article/details/137816530

https://www.jianshu.com/p/f44c736cbfff

Python计算单细胞

https://www.sc-best-practices.org/conditions/differential_gene_expression.html

https://www.sc-best-practices.org/introduction/analysis_tools.html

科普

https://zhida.zhihu.com/search/3649541384815884056?zhida_source=entity

细胞命运投射图参考该文献

https://www.nature.com/articles/s41592-024-02493-2

Nature method

https://www.nature.com/nmeth/

细胞注释

https://mp.weixin.qq.com/s?__biz=MzAxOTM1ODgyNA==&mid=2652134437&idx=1&sn=b5d1f0c8800f22e85d4ec6ad15fca7bc&chksm=8028c922b75f40346bbbd342ba23b3acf684e69f4a73472fb926fafc092b042732b8e93a95f3&scene=27

多个样本数据读入

https://blog.csdn.net/zfyyzhys/article/details/141970110

findcluster,findneigbour（modularity模块度计算）

https://en.wikipedia.org/wiki/Louvain_method

数学方法1

Seurat描述

Cluster the cells Seurat applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partitioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’. As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs). To cluster the cells, we next apply modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters can be found using the Idents() function.

https://zhuanlan.zhihu.com/p/706006172

https://blog.csdn.net/qq_39451578/article/details/119708909

包含原文献

Label transfer/Label

https://blog.csdn.net/wangjunliang/article/details/130650931

Ref

https://satijalab.org/seurat/articles/multimodal_reference_mapping.html https://satijalab.org/seurat/articles/covid_sctmapping

函数

FindTransferAnchors and TransferData.

https://blog.csdn.net/weixin_53637133/article/details/138021006

函数原理/空间转录组单细胞的理论来源>?

http://www.360doc.com/content/23/0704/09/65403234_1087255220.shtml

空间转录组数据单细胞测序分析代码

https://github.com/satijalab/seurat/issues/8323

机器学习术语

https://zhuanlan.zhihu.com/p/152408012

简单流程

https://www.bilibili.com/video/BV1m34y1i7Vf/?vd_source=07b3ae20355d18fee7c57b4801b7d648

数据准备

Pre-processing

以10X cellranger的数据为例（商业软件给出矩阵）

FASTAQ文件开始

Quality control

去除双细胞

常用方法

Python

DoubletDelection

DoubletFinder

文献来源

Xi,Nan Mies Cell system 2021

Normalization

对10X，先归一化到10000，然后log1p

文献来源

Tian Luyi et al Nature methods 2019

Data correction and integration

correction

文献来源

A systematic evaluation of single-cell RNA-sequencing imputation methods

较新的Imputation工具？

WEDGE:imputation of gene expression values from single-cell RNA-seq datasets using biased matrix decomposition

主要目的可能是展示基因，而不会提升后续分析质量

integration

文献来源

Luecken Malte D Nature methods 2022

工具

scANVI, Scanorama, scVI

A benchmark of batch-effect correction methods for single-cell RNA sequencing data

工具

Harmony

LIGER

Seurat3

Feature selection

文献

Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data

主流

Seurat3

dimensionality reduction and visualization

降维

提取主要信息，展示

PCA,Umap,tsne,lsi

文献

Assessing single-cell transcriptomic variablity through density-preserving data visualization

Visualizing structure and transitions in high-dimensional biological data

有很多新的改进方法

Cluster analysis&annotation

Cluster

scanpy

Seurat

kiselev Nature Revies Genetics2019

Annotation

建议手动注释

Abdelaal,et al Genome biology

分析

Compositional analysis

Trajectory analysis

RNA velocity of single cells

Generalizaing RNA velocity to transient cell states through dynamical modeling

Gene-level analysis(DEGs, Enrichment, GRN

差异分析

Squair Jorand W nature communications 2021

Pratapa Aditya Nature methods 2020

基因调控网络分析

GENIE3?

富集分析

ClusterProfile

包括

GO-Term,GSEA,KEGG

其他分析

如肿瘤中

Deconlution

metacell

基因模块

等等

多组学

详细流程

1.读取数据部分

#设置工作区间位置 setwd('J:/RNAfil/M-GSGC0383944-report_result/result/02.cellranger/data/NR-mOD069') setwd('F:/RNAfil/M-GSGC0383951-report_result/02.cellranger/data/NR-mOD075') Sys.setenv('R_MAX_VSIZE'=32000000000) #2024.6.8 data analysis #读入主要包 library(dplyr) library(Seurat)

分别读gene,barcode,mtx三个数据并整理成Seurat对象 dir.10x = './filtered_feature_bc_matrix/' genesctr = read.table(paste0(dir.10x, 'features.tsv.gz'), stringsAsFactors=F, header=F)$V2 genesctr = make.unique(genesctr, sep = '.') barcodesctr = readLines(paste0(dir.10x, 'barcodes.tsv.gz')) mtxctr = Matrix::readMM(paste0(dir.10x, 'matrix.mtx.gz')) mtx = as(mtxctr, 'CsparseMatrix') #'dgcmatrix��ʱ�ˣ��뻻��seurat' colnames(mtxctr) = barcodesctr rownames(mtxctr) = genesctr pbmcctr = CreateSeuratObject(counts = mtxctr, project = "pbmcctr3k",min.cells = 3, min.features = 200) head(pbmcctr@meta.data,5) #或者 pbmcctr<-Read10X(dir.10x)

barcodes.tsv

features.tsv

matrix.mtx

关于稀疏矩阵重构的知识介绍

https://blog.csdn.net/Nh_code/article/details/125341918

https://blog.csdn.net/jeffery0207/article/details/122507934

跟read10x结果对比下

colnames(mtxctr) = barcodesctr rownames(mtxctr) = genesctr

install.packages("Matrix") ; library(Matrix)

参考资料 Working with a sparse matrix in R - Kamil Slowikowski (slowkow.com) Instructions for using Python SciPy sparse matrix | Develop Paper R convert matrix or data frame to sparseMatrix - Stack Overflow

想问如何使用R语言读取mtx文件？？？其实非常简单，使用Matrix包的readMM函数就行。 matrix_data <- Matrix::readMM("C:\\Users\\yuanz\\Documents\\kugay\\HW_R_3_data\\HW_R_3_data\\filtered_gene_bc_matrices\\matrix.mtx") 上面matrix_data是稀疏矩阵，要转换为dataframe也非常简单 my_summary <- summary(matrix_data) ———————————————— 版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/yuanzhoulvpi/article/details/106623638

https://huaweicloud.csdn.net/638f0f9fdacf622b8df8e1f5.html

读取矩阵错误处理办法

Seurat对象里具体有什么？

https://blog.csdn.net/m0_45210226/article/details/127759572

https://www.jianshu.com/p/3401ec1f3d18

https://zhuanlan.zhihu.com/p/703033246

源代码分析

https://rdrr.io/

网站查询最快

getAnywhere(Read10X())（查看源代码方法）

https://cloud.tencent.com/developer/article/2115895

源代码

ReadMM

https://rdrr.io/cran/Matrix/src/R/HBMM.R

Read10X

https://zhuanlan.zhihu.com/p/464317172

1. 掉包侠代码 cellranger 处理fastq后的输出文件是3个稀疏矩阵文件。 # 调用的代码 # load data pbmc.data <- Read10X(data.dir = "~/data/scScripts/backup/data/pbmc3k/filtered_gene_bc_matrices/hg19/") class(pbmc.data) dim(pbmc.data) #32738 270

结果

CreateSeuratObject

https://github.com/satijalab/seurat-object/blob/main/man/CreateSeuratObject.Rd

2.数据预处理部分

数据筛选

#数据处理 #处理多细胞和线粒体基因 pbmcctr[["percent.mt"]] <- PercentageFeatureSet(pbmcctr, pattern = "^mt-") VlnPlot(pbmcctr, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) #FeatureScatter通常用于可视化 feature-feature 相关性， #nCount_RNA与nFeature_RNA的相关性 plot1 <- FeatureScatter(pbmcctr, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(pbmcctr, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") plot1 + plot2 #我设计的筛选出数字的方法 plot.new() plot(pbmcctr$nCount_RNA,pbmcctr$nFeature_RNA,col="red") #添加直线 abline(h=quantile(pbmcctr$nFeature_RNA,0.99),col="red") abline(v=quantile(pbmcctr$nCount_RNA,0.99),col="red") text(quantile(pbmcctr$nCount_RNA,0.99),quantile(pbmcctr$nFeature_RNA,0.99),label=quantile(pbmcctr$nCount_RNA,0.99)) text(quantile(pbmcctr$nCount_RNA,0.99)+7000,quantile(pbmcctr$nFeature_RNA,0.99),label=quantile(pbmcctr$nFeature_RNA,0.99)) pbmcctr <- subset(pbmcctr, subset = nFeature_RNA > 200 & nFeature_RNA <6365 & percent.mt < 5) （根据标准去筛选，具体数字可变）

结果

VlnPlot函数解读

https://satijalab.org/seurat/reference/vlnplot

代码

https://github.com/satijalab/seurat/blob/72a0c7b61adcf1d6630db9e952bf8b0d4f979295/R/visualization.R#L551

单细胞小提琴图原理分析

https://cloud.tencent.com/developer/article/1677923

堆叠小提琴图

https://cloud.tencent.com/developer/article/2317524

需要提前处理数据类型

对比ggplot

Left vln_df = data.frame(FOXP3 = obj[["RNA"]]@data["FOXP3",], cluster = obj$seurat_clusters) ggplot(vln_df, aes(x = cluster, y = FOXP3)) + geom_violin(aes(fill = cluster), trim=TRUE, scale = "width") + geom_jitter() Right VlnPlot(obj, features = "FOXP3")

https://github.com/satijalab/seurat/issues/3322

nFeature,nCount,percent.mt解读

https://www.biostars.org/p/407036/

nFeature_RNA代表每个细胞测到的基因数目。 nCount_RNA代表每个细胞测到所有基因的表达量之和。 percent.mt代表测到的线粒体基因的比例。

代码

结果

PercentageFeatureSet函数

Calculate the percentage of all counts that belong to a given set of features Source: R/utilities.R This function enables you to easily calculate the percentage of all the counts belonging to a subset of the possible features for each cell. This is useful when trying to compute the percentage of transcripts that map to mitochondrial genes for example. The calculation here is simply the column sum of the matrix present in the counts slot for features belonging to the set divided by the column sum for all features times 100. PercentageFeatureSet( object, pattern = NULL, features = NULL, col.name = NULL, assay = NULL ) Arguments object A Seurat object pattern A regex pattern to match features against features A defined feature set. If features provided, will ignore the pattern matching col.name Name in meta.data column to assign. If this is not null, returns a Seurat object with the proportion of the feature set stored in metadata. assay Assay to use Value Returns a vector with the proportion of the feature set or if md.name is set, returns a Seurat object with the proportion of the feature set stored in metadata. Examples data("pbmc_small") # Calculate the proportion of transcripts mapping to mitochondrial genes # NOTE: The pattern provided works for human gene names. You may need to adjust depending on your # system of interest pbmc_small[["percent.mt"]] <- PercentageFeatureSet(object = pbmc_small, pattern = "^MT-")

FeatureScatter函数

https://bioinformatics.stackexchange.com/questions/13327/how-to-specifically-select-cluster-on-featurescatter-on-seurat

另外的应用

https://samuel-marsh.github.io/scCustomize/reference/Split_FeatureScatter.html

我设计的能显示筛选数的方法

数据正态化处理

#��һ�� pbmcctr <- NormalizeData(pbmcctr, normalization.method = "LogNormalize", scale.factor = 10000) # Checkpoint

NormalizeData函数

Arguments object An object ... Arguments passed to other methods normalization.method Method for normalization. “LogNormalize”: Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor. This is then natural-log transformed using log1p “CLR”: Applies a centered log ratio transformation “RC”: Relative counts. Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor. No log-transformation is applied. For counts per million (CPM) set scale.factor = 1e6 scale.factor Sets the scale factor for cell-level normalization margin If performing CLR normalization, normalize across features (1) or cells (2) block.size How many cells should be run in each chunk, will try to split evenly across threads verbose display progress bar for normalization procedure assay Name of assay to use Value

UMI含义

https://mp.weixin.qq.com/s?__biz=Mzg3MjE5NjM0NA==&mid=2247487703&idx=1&sn=612e265368bae748983b199548091785&chksm=cef3a140f98428567205c88f1a424b730521d0ff62631241a5ed9d843bbe20ff2c1b69dc448e&scene=27

特异性分子标签（Unique Molecular Indentifier，UMI）是一段随机化或特定的核苷酸短序列，通常设计为完全随机的核苷酸链（如NNNNNN），部分简并核苷酸链（如NNNRNYN）或者固定核苷酸链（在模板分子有限的情况下）。

scale.factor与UMI的关系

https://zhuanlan.zhihu.com/p/539906426

scale.factor=10000的原因

https://blog.csdn.net/m0_45210226/article/details/127759572

新的转换方案ScCustom

https://blog.csdn.net/jiangshandaiyou/article/details/130360370

pbmcctr <- FindVariableFeatures(pbmcctr, selection.method = "vst", nfeatures = 2000) #这一步可以提取出所有基因名字，如果对该次分析的基因名字不清楚，可以导出all.genes查看 top10 <- head(VariableFeatures(pbmcctr), 10) #��ǰʮ�ǣ� all.genes <- rownames(pbmcctr)

分析含义

https://www.jianshu.com/p/0fbff6853305 高变异基因就是highly variable features（HVGs），就是在细胞与细胞间进行比较，选择表达量差别最大的基因，Seurat使用FindVariableFeatures函数鉴定高可变基因，这些基因在不同细胞之间的表达量差异很大（在一些细胞中高表达，在另一些细胞中低表达）。默认情况下，会返回2,000个高可变基因用于下游的分析，如PCA等。利用FindVariableFeatures函数，会计算一个mean-variance结果，也就是给出表达量均值和方差的关系并且得到top variable features,这一步的目的是鉴定出细胞与细胞之间表达量相差很大的基因，用于后续鉴定细胞类型。 2 FindMarkers()–寻找差异表达基因: Seurat使用FindMarkers和FindAllMarkers函数进行差异表达基因的筛选两篇很好的单细胞分析问答贴Confusion about FindMarkers(), FindVariableFeatures(),RunTSNE(), and RunUMAP() in seurat package： https://www.biostars.org/p/406388/ https://www.jianshu.com/p/5a06ebfba7bd ———————————————— 版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/watermel__/article/details/121274775

https://www.jianshu.com/p/0fbff6853305

单细胞转录组/FindMarkers

https://blog.csdn.net/prublue/article/details/122196661

pbmcctr <- ScaleData(pbmcctr, features = all.genes)

harmony与ScaleData的影响

https://so.csdn.net/so/search?spm=1001.2101.3001.4498&q=ScaleData&t=&u=

为什么要做这一步

1 after NormalizeData() function, why ScaleData() function is needed ? NormalizeData() only accounts for the depth of sequencing in each cell (reads*10000 divide by total reads, and then log). ScaleData() zero-centres and scales it (See ?ScaleData). Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. Scaling is a routine thing to do for enhancing clustering or other analyses. (You may also like to see scale() function in R) In the recent versions of Seurat, the ScaleData function is also used to regress out unwanted variables.

3 is ScaleData() absolutely needed in the scRNA-seq analysis ? Scaling is not inherent to scRNA-Seq. It is an important aspect of many machine learning / dimensional reduction algorithms where the distance between the features is compared. If you don't scale, the feature which has large range of variation might dominate/bias your analysis (because they will get large distances). Scaling "normalizes" this large variations among the features.

https://github.com/satijalab/seurat/issues/950

源代码分析

https://zhuanlan.zhihu.com/p/485224034

数据降维

降维原理

UMAP,TSNE

https://blog.csdn.net/weixin_53637133/article/details/138210252?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-5-138210252-blog-138020914.235^v43^pc_blog_bottom_relevance_base4&spm=1001.2101.3001.4242.4&utm_relevant_index=8

比较这两种方法

主成分分析

https://blog.csdn.net/weixin_53637133/article/details/138020914

原理

主成分分析中，第一主成分和第二主成分是通过数据降维过程中的方差最大化原则来区分的。第一主成分（PC1）是数据中方差最大的方向，它解释了数据集中最大的变异；而第二主成分（PC2）是与第一主成分正交（即垂直）的方向上方差次大的方向，它解释了除去第一主成分所解释的变异后剩余变异中的最大部分。要详细区分第一主成分和第二主成分，可以从以下几个方面进行讲解： 1. 方差的解释：主成分分析的目标是通过转换原始变量为新的正交变量（即主成分），使得这些新变量的方差最大化。第一主成分（PC1）是方差最大的方向，它反映了数据集中最大的变异；而第二主成分（PC2）是在与第一主成分垂直的平面上方差最大的方向，它解释了除去第一主成分所解释的变异后剩余变异中的最大部分。 2. 几何意义：在二维空间中，第一主成分和第二主成分可以分别看作是数据点的“最长”和“次长”的投影方向。第一主成分（PC1）是数据点投影后方差最大的方向，而第二主成分（PC2）是与第一主成分垂直的方向上投影后方差次大的方向。

3. 计算过程：在计算主成分时，首先计算协方差矩阵，然后求该矩阵的特征值和特征向量。特征值的大小反映了各个主成分的方差大小，而特征向量则代表了各个主成分的方向。通常，将特征值按从大到小的顺序排列，对应的特征向量就是各主成分的方向。第一主成分对应的特征值最大，第二主成分对应的特征值次大，以此类推。 4. 实际应用：在实际应用中，第一主成分和第二主成分通常用于数据的可视化和降维。例如，在二维散点图中，可以用第一主成分和第二主成分作为坐标轴，将高维数据投影到二维平面上进行可视化。同时，在数据降维时，可以选择保留方差贡献率较高的前几个主成分，以达到减少数据维度、保留主要信息的目的。通过以上几个方面的讲解，我们可以清晰地理解主成分分析中第一主成分和第二主成分的区别和联系。在实际应用中，我们可以根据具体需求选择合适的主成分进行分析和处理。

机器学习之线性判别分析（LDA) 主成分分析(PCA)

https://blog.csdn.net/linyanqing21/article/details/50938948

Logistic factor analysis(LFA)

https://learn.gencore.bio.nyu.edu/single-cell-rnaseq/seurat-part-3-data-normalization/

数据降维

DimPlot函数

https://blog.csdn.net/qq_45759229/article/details/139078539?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-1-139078539-blog-117712052.235^v43^pc_blog_bottom_relevance_base2&spm=1001.2101.3001.4242.2&utm_relevant_index=4

优化图样式

http://www.360doc.com/content/24/0430/20/82995563_1122020662.shtml

函数部分

Seurat

函数

PercentageFeatureSet: Calculate the percentage of all counts that belong to a given set of features

https://www.rdocumentation.org/packages/Seurat/versions/5.0.1/topics/PercentageFeatureSet

PercentageFeatureSet( object, pattern = NULL, features = NULL, col.name = NULL, assay = NULL )

data("pbmc_small") # Calculate the proportion of transcripts mapping to mitochondrial genes # NOTE: The pattern provided works for human gene names. You may need to adjust depending on your # system of interest pbmc_small[["percent.mt"]] <- PercentageFeatureSet(object = pbmc_small, pattern = "^MT-")

FeatureScatter: Scatter plot of single cell data

FeatureScatter( object, feature1, feature2, cells = NULL, shuffle = FALSE, seed = 1, group.by = NULL, cols = NULL, pt.size = 1, shape.by = NULL, span = NULL, smooth = FALSE, combine = TRUE, slot = "data", plot.cor = TRUE, raster = NULL, raster.dpi = c(512, 512), jitter = FALSE )

data("pbmc_small") FeatureScatter(object = pbmc_small, feature1 = 'CD9', feature2 = 'CD3E')

NormalizeData: Normalize Data

Normalize the count data present in a given assay.

NormalizeData(object, ...) # S3 method for default NormalizeData( object, normalization.method = "LogNormalize", scale.factor = 10000, margin = 1, block.size = NULL, verbose = TRUE, ... ) # S3 method for Assay NormalizeData( object, normalization.method = "LogNormalize", scale.factor = 10000, margin = 1, verbose = TRUE, ... ) # S3 method for Seurat NormalizeData( object, assay = NULL, normalization.method = "LogNormalize", scale.factor = 10000, margin = 1, verbose = TRUE, ... )

文章

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6865822/ AD单细胞测序 OL相关

官网教程

https://satijalab.org/seurat/articles/pbmc3k_tutorial

多组的整合分析

https://cran.r-project.org/web/packages/harmony/vignettes/Seurat.html

https://satijalab.org/seurat/articles/integration_introduction.html

Seurat官网整合流程

The following tutorial is designed to give you an overview of the kinds of comparative analyses on complex cell types that are possible using the Seurat integration procedure. Here, we address a few key goals: Identify cell subpopulations that are present in both datasets Obtain cell type markers that are conserved in both control and stimulated cells Compare the datasets to find cell-type specific responses to stimulatio

参考数据Transferdata

https://cloud.tencent.com/developer/article/1931217

Anchor方法

http://www.bio-info-trainee.com/9486.html

https://blog.csdn.net/lijianpeng0302/article/details/133690940

cca和harmony

https://blog.csdn.net/zfyyzhys/article/details/142314175

https://blog.csdn.net/lijianpeng0302/article/details/133690940

整合方法比较（代码比较详细，这个可以看下）

http://www.360doc.com/content/21/0819/22/76149697_991789987.shtml#google_vignette

https://cloud.tencent.com/developer/article/1931217

这个例子用的panc8，装了azimouth可以试一下、

多样本整合新方法

https://zhuanlan.zhihu.com/p/459759699

直接用azimuth数据比对的话

https://satijalab.github.io/azimuth/articles/run_azimuth_tutorial.html#scrna-seq-queries

Azimouth

https://satijalab.github.io/azimuth/articles/run_azimuth_tutorial.html

解决 “无法在貯藏處https://bioconductor.org/packages/3.16/bioc/src/contrib中读写索引: cannot open URL ”

梅子青时雨711 于 2023-06-10 16:31:09 发布阅读量1w 收藏 15 点赞数 27 文章标签： r语言版权 GitCode 开源社区文章已被社区收录加入社区 R中安装包时出现如下报错： Warning: 无法在貯藏處https://bioconductor.org/packages/3.16/bioc/src/contrib中读写索引: cannot open URL 'https://bioconductor.org/packages/3.16/bioc/src/contrib/PACKAGES' 解决方法：在文末添加如下语句或在R/RStudio终端下键入： options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor") ———————————————— 版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/weixin_62940906/article/details/131143554

Azimouth data cell marker

整合的报错问题

https://github.com/satijalab/seurat/issues/8718

之前对比不同对象遇到过这个问题

SCTransform

https://satijalab.org/seurat/articles/sctransform_vignette.html

统计cluster中某基因细胞数量

https://github.com/satijalab/seurat/issues/371

统计每个CLUSTER数量的另外一个办法

https://www.biostars.org/p/399234/

提取现在的子集进行下一步分析

https://github.com/satijalab/seurat/issues/752

GEO数据下载

https://www.bilibili.com/video/BV1WC4y1j75X/?vd_source=07b3ae20355d18fee7c57b4801b7d648

讲了有哪些数据类型

https://www.bilibili.com/video/BV1Ky42167JA/?vd_source=07b3ae20355d18fee7c57b4801b7d648

比较简单的教程

有代码

SRA原始数据下载

https://blog.csdn.net/u010608296/article/details/121878169

https://www.jianshu.com/p/7022f368554b

子主题

好的应用例子和文章

相关技术

10X visium

https://www.10xgenomics.com/products/spatial-gene-expression

10X Visium（55um）进行高分辨率降维分析

https://blog.csdn.net/weixin_53637133/article/details/142459997

文献

Spatiotemporal transcriptomic maps of whole mouse embryos at the onset of organogenesis

https://www.nature.com/articles/s41588-023-01435-6

Profiling spatiotemporal gene expression of the developing human spinal cord and implications for ependymoma origin

https://www.nature.com/articles/s41593-023-01312-9

统计检验

置换检验

https://mp.weixin.qq.com/s?__biz=MzIzMDgzNzk4NQ==&mid=2247486046&idx=1&sn=448eda8b25c8d5746dc06ee3746d913a&chksm=e8ac11d0dfdb98c6d10cccfedab2cd7039a52b84953f2112723d26038df56a47fdf6dc8d8c42&scene=27

https://blog.csdn.net/weixin_41880581/article/details/110385826

https://www.plob.org/article/24942.html

文献

https://pmc.ncbi.nlm.nih.gov/articles/PMC6865822/

临时空间不足问题

https://blog.csdn.net/fangshan66/article/details/128664453

更改工作路径，Libpath

https://blog.csdn.net/qq_45642410/article/details/116858793

UMAP原理

1.流形

https://blog.csdn.net/liangwqi/article/details/142861214?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-0-142861214-blog-127318582.235^v43^pc_blog_bottom_relevance_base2&spm=1001.2101.3001.4242.1&utm_relevant_index=3

流形（manifold）流形是几何中的一个概念，它是高维空间中的几何结构，即空间中的点构成的集合。可以简单的将流形理解成二维空间的曲线，三维空间的曲面在更高维空间的推广。下图是三维空间中的一个流形，这是一个卷曲面，像一个瑞士卷一样，这个图就表示观察到的数据是三维的，但其本质是一个二维流形，因为曲面是2维的。我们可以想象成输入的数据是三维的，但真正表征这个数据的核心特征就是一个二维的，其余的都是维度都是冗余的，所以这里的二维流形也就是表征这个数据的核心特征！所以深度学习的本质就是说某些高维数据，实际是一种低维的流形结构嵌入在高维空间中，这个低维流型结构就是我们提取得到的重要特征。图上所标注的两个圈圈，在流形（把卷展开）上本距离非常远，但是用三维空间的欧氏距离来计算则它们的距离要近得多。： 2维空间中的曲线，3维空间中的曲线可以看做是2维和3维空间中的1维流形，因为曲线是1维的。而3维空间中的曲面可以看做是2维的流形，因为曲面是2维的。n维空间中的m维流形就是具有m维几何形状的一个子集，在这里，m小于n。流形学习（manifold learning）是机器学习、模式识别中的一种方法，在维数约简方面具有广泛的应用。它的主要思想是将高维的数据映射到低维，使该低维的数据能够反映原高维数据的某些本质结构特征。流形学习的前提是有一种假设，即某些高维数据，实际是一种低维的流形结构嵌入在高维空间中。流形学习的目的是将其映射回低维空间中，揭示其本质。 ———————————————— 版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/Rolandxxx/article/details/127318582

中文简单介绍

https://blog.csdn.net/deephub/article/details/121302082?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522817bca8cccf737f70cd68b5913865368%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=817bca8cccf737f70cd68b5913865368&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-2-121302082-null-null.142^v100^pc_search_result_base8&utm_term=umap%E5%8E%9F%E7%90%86&spm=1018.2226.3001.4187

黎曼度量

https://zhuanlan.zhihu.com/p/338853339

k-means算法

https://zhuanlan.zhihu.com/p/362596454

python版算法实现

多组学整合

scRNA-seq与scATAC-seq整合

https://zhuanlan.zhihu.com/p/694315872

https://www.sohu.com/a/805068063_121123706

研究思路流程，这个图做的很好，可以吸纳

https://zhuanlan.zhihu.com/p/8456250574

https://blog.csdn.net/weixin_53637133/article/details/138602444

scJoint 新方法

https://www.zhihu.com/question/420274230/answer/3477131054

ATAC代码

https://zhuanlan.zhihu.com/p/348713300

讲述具体方法和优劣

scATAC-seq

分析流程，有详细代码

scATAC相关研究方法

https://www.163.com/dy/article/JP3CHPIK0532HZIO.html

综上所述，scAGDE模型通过深度学习和图嵌入技术，创新性地解决了单细胞ATAC-seq数据分析中的多个挑战，特别是在数据降维、细胞聚类和转录调控机制解析方面。它能够高效处理数据的稀疏性，精准识别细胞类型特异性的染色质调控元素，并揭示细胞功能的微妙差异。scAGDE通过自编码器和图神经网络结合，捕捉细胞间的拓扑关系，自动化聚类并减少对先验知识的依赖。在实际应用中，scAGDE展示了其在神经和免疫细胞调控机制中的强大分析能力，尤其在区分功能上具有细微差异的细胞群体时表现突出。

整合数据方法

GTAD

https://academic.oup.com/bib/article/25/1/bbad469/7484598

SPACIALSCOPE

https://www.nature.com/articles/s41467-023-43629-w

Signac分析ATAC workflow

https://stuartlab.org/signac/articles/pbmc_vignette.html

子主题https://stuartlab.org/signac/articles/mouse_brain_vignette

https://cloud.tencent.com/developer/article/2417241

整合单细胞ATAC与基因组序列，机器学习分析ATAC新方法SANGO

通过利用伪时间空间（pseudo-time-space，PSTS）的方法（https://www.nature.com/articles/s41467-023-43120-6）结合时间轴发育和空间分布，明确核心转录因子对表达的时空特异性。

介绍整合的review

https://pmc.ncbi.nlm.nih.gov/articles/PMC11811497/pdf/VJGB-28-8-24104.pdf

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02706-x

scdart

一种整合方法

介绍scATAC比bulk-ATAC的优势的文章

https://pmc.ncbi.nlm.nih.gov/articles/PMC11779887/pdf/41598_2025_Article_87351.pdf

参考比对数据来源

https://pmc.ncbi.nlm.nih.gov/articles/PMC10719113/#notes3

工作流程

1.利用Genomic Ranges将ATAC峰值数据标注到genomic ranges中

2.结合JASPAR数据库中的motif，准备PFM数据库

PFM概念

https://blog.csdn.net/weixin_46128755/article/details/125649496

Sequence Logo 如下图所示(图片来自JASPAR)，这是我们在文章中或motif分析中经常是使用的图，用来表征序列的一致性和多样性。字母越大，代表在该位置出现该核苷酸或者氨基酸的概率越大，常用Bits或者百分比表示。 Dof2

3.利用motifmatchr包找出ATAC中有哪些motif

需要参考数据来源

Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain

https://pmc.ncbi.nlm.nih.gov/articles/PMC10719113/#notes3

4.按照共同占据同一ATAC峰值区域为标准，挖掘出TFBS数据中各类细胞内的转录因子对。利用ATAC基因上的共表达分析和以超几何分布为基础的共富集分析找出共表达于同一基因特定区域，共富集分析中明显差异的各共表达转录因子对。

5.利用Seurat分出神经元，OL，星形胶质细胞，小胶质细胞及室管膜细胞等细胞类型，结合③中整理的转录因子对，利用深度学习方法建立共表达转录因子对-靶基因的基因调控网络，利用Seurat的FindMarkers函数进一步确认不同年龄组间的差异基因，确定各细胞类型中受转录因子的调控的靶基因。

6.用annotatr包将scATAC-seq中的峰值数据分类为启动子和增强子。

使用方法

https://blog.csdn.net/weixin_46605479/article/details/130867433

7.利用Python中的Pytorch,利用SHAP中的SI值计算，找出对TG影响最明显的co-TF组合

https://zhuanlan.zhihu.com/p/663286762

https://christophm.github.io/interpretable-ml-book/shap.html

GITHUB教程

https://www.zhihu.com/question/404524150

SHAP的代码

理论文章

http://proceedings.mlr.press/v119/sundararajan20a/sundararajan20a.pdf

eQTL概念

‌eQTL（Expression Quantitative Trait Loci）‌是指染色体上一些能特定调控mRNA和蛋白质表达水平的区域，其mRNA/蛋白质的表达水平量与数量性状成比例关系。eQTL分析是将基因表达水平的变化和基因型连接起来，研究遗传突变与基因表达的相关性‌ 1 2 。 eQTL的基本概念和分类 eQTL可以分为两类： ‌顺式eQTL（cis-eQTL）‌：主要指与所调控基因相距较近的eQTL，通常位于所调控基因的上下游1Mb区域内。 ‌反式eQTL（trans-eQTL）‌：与顺式eQTL相反，指距离所调控基因位置较远的eQTL，有时距离甚至超过5Mb‌ 2 4 。 eQTL的应用和研究方法 eQTL分析通常使用线性回归模型，将SNP位点与基因表达量进行相关性分析。具体步骤包括： ‌数据准备‌：需要样本信息文件、基因表达量文件和基因型数据。 ‌数据分析‌：使用线性回归模型分析SNP和基因表达量的关系，例如：gene1 \~ snp1 + sex + age + error_term，其中gene1是因变量，表示一个基因的表达量，snp1是自变量，表示一个SNP位点‌ 2 。 eQTL在复杂疾病研究中的应用案例在帕金森病（PD）的研究中，通过整合多种组织（如血液、脑脊液和大脑）的eQTL和pQTL数据集，应用孟德尔随机化、共定位分析等方法，鉴定出与PD相关的潜在致病基因。例如，大脑背外侧前额叶皮层的pQTL分析鉴定出多个与PD风险相关的蛋白质，进一步揭示了遗传因素在PD发病中的作用机制‌ 5 。

详见

https://blog.csdn.net/watermel__/article/details/130778911

qTL结合GWAS分析案例

https://zhuanlan.zhihu.com/p/661636845

8.根据筛选出的co-TF对，进一步筛选出其靶基因，重新构建核心基因调控网络。

原文

a. We chose one cooperative TF pair for each of the six key TFs based on the top interaction scores. b. We built a gene regulatory network that has linkages for cooperative TF pairs to TGs. c. We selected TGs that are co-regulated by cooperative TF pairs. d. We generated a network plot using Cytoscape56

9.TF做层次分析

https://blog.51cto.com/u_16213371/11954330

调控网络分析

https://inarwhal.github.io/NetworkAnalysisR-book/ch9-Network-Centrality-Hierarchy-R.html

10.eQTL验证

https://www.ebi.ac.uk/eqtl/

线上资源

延伸研究

https://pmc.ncbi.nlm.nih.gov/articles/PMC11587609/

scRNA-seq与空间转录组整合

时空组学2025年趋势

https://mp.weixin.qq.com/s/ue3EVzVdhYAGLf1TVhjYvw

https://it.sohu.com/a/810093857_121123706

研究思路流程，这个图做的很好，可以吸纳

https://cloud.tencent.com/developer/article/1970755

https://mp.weixin.qq.com/s?__biz=MzI2NjM4MjM3NA==&mid=2247509082&idx=2&sn=a9a4f44eedb6074af85a682f728e21e2&chksm=ea8c1913ddfb9005f5d3f4c49e8410643bf4f4f09eb0402b4925f545c9d77e3b5d5aeb7009d5&scene=27

联合空间转录组的优势

https://zhuanlan.zhihu.com/p/676902091

其中有几张图的做法可以思考下，比如多个细胞cluster的GO 分析？

学习课程流程表格

https://mp.weixin.qq.com/s/BAzZ_h88iYfV2YcDUdLzyw

https://www.zhihu.com/search?type=content&q=%E5%A4%9A%E7%BB%84%E5%AD%A6%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E7%8E%B0%E7%8A%B6%20%E5%8D%95%E7%BB%86%E8%83%9E%20%E7%A9%BA%E9%97%B4%E7%BB%84%E5%AD%A6

文献

https://pmc.ncbi.nlm.nih.gov/articles/PMC11725393/

https://pmc.ncbi.nlm.nih.gov/articles/PMC11399094/

SpatialGlue

整合方法

机器学习Convnet 2021.8

https://zenodo.org/records/3948711

详细代码

CoSTA：用于空间转录组分析的无监督卷积神经网络学习方法

https://cloud.tencent.com/developer/article/1970848

利用visium整合单细胞，空间转录组代码，详细

https://www.10xgenomics.com/jp/analysis-guides/integrating-10x-visium-and-chromium-data-with-r

另一种相关代码

https://satijalab.org/seurat/articles/spatial_vignette

Seurat提示的两种

https://satijalab.org/seurat/articles/spatial_vignette

https://satijalab.org/seurat/articles/visiumhd_analysis_vignette

Spatioscope

https://www.nature.com/articles/s41467-023-43629-w

Spatioglue

分析方法，·空间组学

TrimNN

https://pmc.ncbi.nlm.nih.gov/articles/PMC11774463/#S31

STAIG

STAIG: Spatial transcriptomics analysis via image-aided graph contrastive learning for domain exploration and alignment-free integration

https://pmc.ncbi.nlm.nih.gov/articles/PMC11772580/

工作流程

1.利用spatialGLUE整合st-ATAC-RNA数据

https://www.biorxiv.org/content/10.1101/2023.04.26.538404v2.full.pdf

时间轴上的一些方案

Visualizing gene expression changes in time, space, and single cells with expressyouRcell

https://pmc.ncbi.nlm.nih.gov/articles/PMC10220493/

拟时序分析

Monocle3

‌拟时序分析（Pseudo-temporal Analysis）是一种在生物学研究中常用的方法，特别是在‌单细胞RNA测序（‌scRNA-seq）数据分析中。这种方法旨在通过模拟细胞发育或分化过程中的时间顺序，来理解和解析细胞状态的变化。尽管实际的细胞样本可能是在不同时间点或条件下收集的，但拟时序分析能够基于细胞的基因表达模式，推断出一个假定的时间轨迹，从而揭示细胞状态随时间变化的动态过程。

结合图像的PTS分析！

转录组

ST-genome

免疫荧光染色信息！

https://www.nature.com/articles/s41467-023-43120-6

深度学习相关

网格组织形式

https://cloud.tencent.com/developer/article/1082134

公众号方法总结

https://mp.weixin.qq.com/s/kKIJHNKVdW5jcL0gWGrdMA

mixOmics

https://mp.weixin.qq.com/s/zOaxMQvnxn7LWuKZpu2-ew

https://www.bioconductor.org/packages/release/bioc/vignettes/mixOmics/inst/doc/vignette.html

所有组学的分析方法

https://mp.weixin.qq.com/s/9uHHV5TGYqlkBtA1CyvXGg

集成多组学数据的机器学习在生物医学中的应用

https://zhuanlan.zhihu.com/p/531132967

多组学大数据与医学发展

https://mp.weixin.qq.com/s?__biz=MzA3MDM4MDkyMA==&mid=2651055372&idx=2&sn=5486bbdfc8d913a44a371d250fe4386f&chksm=85d695f538e2416c530f51ac64a9f438a86b06ecdf524d80a3bca92cdc5441707fa361f27563&scene=27

https://zhuanlan.zhihu.com/p/661944027

WNN整合多组学，有代码流程

https://blog.csdn.net/wangjunliang/article/details/143192137

什么是WNN？

有代码的整合ATAC和scRNA-seq流程

https://blog.csdn.net/weixin_53637133/article/details/138268441

细胞间相互作用，细胞通讯

https://www.nature.com/articles/s41576-020-00292-x

: 09 November 2020

https://pmc.ncbi.nlm.nih.gov/articles/PMC1590034/

？

https://www.zhihu.com/question/404524150

SHAP方法

https://zhida.zhihu.com/search/3651621399482845421?zhida_source=entity

SI值意义

https://frontlinegenomics.com/methods-to-study-cell-cell-communication/

介绍方法的综述

时间轴上的细胞分析

https://academic.oup.com/nar/article/49/W1/W641/6298618

通路分析

GEOdensity

https://www.nature.com/articles/s41467-023-44206-x

SCPA

https://pmc.ncbi.nlm.nih.gov/articles/PMC10704209/#SD4

介绍文章

Python

https://www.sc-best-practices.org/conditions/gsea_pathway.html

富集分析

？

OPC NETWORK ANALYSIS·

https://pmc.ncbi.nlm.nih.gov/articles/PMC5550273/

https://pmc.ncbi.nlm.nih.gov/articles/PMC10946295/

The epigenetic landscape of oligodendrocyte progenitors changes with time

https://cloud.tencent.com/developer/article/2417241

整合单细胞ATAC与基因组序列，机器学习分析ATAC新方法SANGO

https://zhuanlan.zhihu.com/p/445130322

利用clusterfiles R包做富集分析，有详细代码

GO和KEGG的原理

https://zhuanlan.zhihu.com/p/50863682

利用统计学超几何分布做富集

RNAseq

GEO数据

解读含义

https://www.ncbi.nlm.nih.gov/geo/info/overview.html

https://www.jianshu.com/p/e0c72ca39ab7

Alignment

https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html

advantage of pair end read

注意下面提到chip-seq用的single-end更好，为什么？

https://www.biostars.org/p/68631/

https://academic.oup.com/bib/article/18/2/279/2453282

https://www.youtube.com/watch?v=dJdZk-duMZQ

介绍Alignment软件，方法

使用DESq2分析RNA-SEQ数据

数据格式

样本信息

分组信息

表达矩阵

原始count值

两者是分开的文件的话

https://blog.csdn.net/zrc_xiaoguo/article/details/135099340?ops_request_misc=%257B%2522request%255Fid%2522%253A%25226008bcbbd0046499743910695baef8ca%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=6008bcbbd0046499743910695baef8ca&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~top_click~default-2-135099340-null-null.142^v101^pc_search_result_base8&utm_term=deseq2&spm=1018.2226.3001.4187

count_matrix.csv: 包含基因或转录本的表达矩阵，行代表基因或转录本，列代表样本。 sample_info.csv: 包含每个样本的信息，例如条件、分组等。

解释和注意事项： DESeqDataSetFromMatrix(): 用于创建 DESeq2 数据对象，其中 countData 是表达矩阵，sampleInfo 包含样本信息，design 参数指定实验设计。 DESeq(): 对数据进行归一化和标准化，准备进行差异表达分析。 results(): 提取差异表达分析的结果，包括基因表达差异统计信息。结果包括基因表达水平的差异统计指标，如 fold change、调整的 p 值（padj）等。 plotCounts(): 用于绘制基因表达水平的差异示意图，以更直观地展示不同条件下基因的表达情况。 ———————————————— 版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/zrc_xiaoguo/article/details/135099340

（1）用DESeq2 R包做差异分析。KO的表达矩阵里NG5是对照组，KO是实验组；用Word总结代码学习过程的背景、R包基本原理、RNAseq表达矩阵的相关背景解释；

代码

读取数据

#读取数据 getwd() list.files() [5] "kogenes.read.txt" [6] "oegenes.read.txt" #可以用正则规则读取批量，如果是分开读直接按上述观察文件情况读取即可 kogene<-read.table(paste0(getwd(),"/",list.files()[6]),row.names=1,header=T)

生成DEG对象

library(openxlsx) coldata<-read.xlsx(list.files()[2]) coldata$Condition <- factor(coldata$Condition, levels = c("control","knock-out")) table(coldata$Condition) dds <- DESeqDataSetFromMatrix(countData = kogene, colData = colddata, design = ~Condition) dds如下图所示 https://zhuanlan.zhihu.com/p/24803546705 design = ~group：这定义了实验的分组情况 colData = colData：这是一个数据框，包含样本的列信息（例如，样本名、组别等）。 assays = dat：count表达矩阵，行代表基因，列代表样本。先筛选掉低表达的基因（为了尽量避免低表达基因导致的假阳性结果），之后估计每个样本的大小因子（大小因子是用于校正不同样本之间测序深度差异的一种方法，以确保在进行基因表达比较时能够公平地比较不同样本）。 dds <- dds[rownames(counts(dds)) > 1, ] # 过滤掉在任何样本中计数都小于或等于1的基因。 dds <- estimateSizeFactors(dds) # 估计每个样本的大小因子 estimateSizeFactors函数会根据每个样本的测序数据（通常是基因表达计数数据）来计算一个系数，这个系数反映了该样本相对于其他样本的测序深度，这个系数就是大小因子。具体来说，DESeq2首先会对原始的表达量矩阵进行log转换，然后计算每个基因在所有样本中的均值。接下来，对于每个样本，它会将该样本中每个基因的表达量减去对应的所有样本中的均值，然后取中位数。最后，通过对这些中位数进行指数运算，得到每个样本的大小因子。一旦计算出了每个样本的大小因子，就可以将其用于归一化原始的表达量数据。归一化的目的：是消除不同样本之间由于测序深度差异而导致的偏差，使得不同样本之间的基因表达数据可以更加公平地进行比较和分析之后通过DESeq函数用归一化后的数据进行差异表达分析 dds <- DESeq(dds) 最后提取差异表达分析的结果，并将其转换成数据框 res <- results(dds, contrast = c("Condition","knock-out","control")) DEG 如下图所示： baseMean: 这一列显示的是每个基因在所有样本中的平均表达量，经过大小因子校正后的值。它反映了基因的总体表达水平。 log2FoldChange: 这一列展示的是基因表达的对数2倍变化值（log2 fold change），正值表示在disease中表达上调，负值表示在disease中表达下调。 lfcSE: 这一列是log2FoldChange的标准误差估计值，它提供了log2FoldChange估计的不确定性度量 - stat: 这一列是基因表达变化的统计检验值，用于衡量基因表达变化的显著性。一个大的正值或负值表示基因表达的变化是否显著。 pvalue: 这一列显示的是基因表达变化的p值，用于评估观察到的效应（如基因表达的变化）是否可能是由于随机误差引起的，通常认为P < 0.05则结果是显著的。 padj: 这一列是调整后的p值，用于控制假阳性率（简单说就是矫正后的P值）给差异分析结果添加上下调标签 DEG$change <- as.factor( ifelse(DEG$pvalue < 0.05 & abs(DEG$log2FoldChange) >= 0.5, ifelse(DEG$log2FoldChange > 0.5,'Up','Down'),'Not')) table(DEG$change) DEG_write <- cbind(symbol = rownames(DEG), DEG) 处理好的 DEG_write 如下图所示，除了新添加的symbol和change，其余与DEG一样，change列中: （筛选条件：|log2FoldChange| >= 0.5 & p.value < 0.05） Up 表示上调基因 Down 表示下调基因 Not 表示未通过筛选的基因筛选出表达具有差异的基因（筛选条件：|log2FoldChange| >= 0.5 & p.value < 0.05） sig_diff <- subset(DEG, DEG$pvalue < 0.05 & abs(DEG$log2FoldChange) >= 0.5) sig_diff_write <- cbind(symbol = rownames(sig_diff), sig_diff) sig_diff_write 与DEG_write的差别就是少了那些没有差异的基因。最后，保存差异分析的结果： write.csv(DEG_write, file = paste0('DEG_all.csv')) write.csv(sig_diff_write, file = paste0('DEG_sig.csv'))

参考资料

https://zhuanlan.zhihu.com/p/24803546705

（2）计算FPKM和TPKM标准化：计算标准化；并用Word总结整理标准化的计数基本原理和标准化的意义

RNA-seq 数据中常用的三种基因表达量标准化指标。它们的主要区别在于标准化过程和适用的测序类型。

RPKM（Reads Per Kilobase of transcript per Million mapped reads）

定义

RPKM 是一种用于单端测序数据的表达量指标。

计算方法

读数标准化：将每个基因的原始读数除以基因的长度（以 kb 为单位），得到每 kb 的读数。深度标准化：将每 kb 的读数除以总的百万比对读数。

用途

用于单端测序的数据分析。

FPKM（Fragments Per Kilobase of transcript per Million mapped reads）

定义

FPKM 是一种用于双端测序数据的表达量指标，与 RPKM 类似，但考虑了双端片段。

计算方法

与 RPKM 类似，但使用片段数（fragments）而不是读数（reads）来进行标准化。

用途

用于双端测序的数据分析。

TPM（Transcripts Per Million）

定义

TPM 是一种表达量指标，常用于比较不同样本间的基因表达。

计算方法

1.计算 RPK（Reads Per Kilobase）：先计算每个基因的 RPK。 2.总标准化：将所有基因的 RPK 相加以获得总 RPK，将每个基因的 RPK 除以总 RPK，然后乘以一百万，得到 TPM。

用途

由于 TPM 对样本总测序量的标准化是在所有基因之和后进行的，因此更适合用于跨样本的比较。

主要区别

计算方式： TPM 的计算顺序不同，首先进行基因长度校正然后进行样本总量校正，而 RPKM 和 FPKM 首先进行样本总量校正后进行基因长度校正。

适用场景： RPKM 适用于单端测序，FPKM 适用于双端测序，而 TPM 则适用于比较不同样本的基因表达水平。

结果一致性：对于同一个样本，TPM 的总和是一致的（百万），而 RPKM 和 FPKM 则没有这样的特性。这使得 TPM 更便于跨样本比较。

总的来说，TPM 在 RNA-seq 数据的跨样本比较中更常用，因为其标准化步骤更适合不同样本的对比。

RPK（每千碱基的读数）

RPK 是计算 RPKM 和 TPM 的基础步骤。它表示一个基因每千碱基的读数数目。计算方法如下：

计算每个基因的读数：统计特定基因的测序读数数目，这被称为“原始读数”。按基因长度标准化：将原始读数除以基因的长度（以 kb 为单位）。

参考材料

https://www.dxy.cn/bbs/newweb/pc/post/37452654

内含这几种值的计算举例

https://zhuanlan.zhihu.com/p/590555572

https://zhuanlan.zhihu.com/p/485344865

代码

apply(kogene,2,sum) rpkmTOtpm <- function(rpkm){ exp(log(rpkm) - log(sum(rpkm)) + log(1e6)) } #自行做一个转换函数 #RPKM转TPM： tpm <- apply(kogene, 2, rpkmTOtpm) 、 apply(tpm, 2, sum) #这里后续要做什么？

代码

library(ggplot2) library(ggrepel) library(RColorBrewer) library(tidyverse) # 设定差异基因阈值 sig_diff<-sig_diff %>% mutate(logFC=log2FoldChange,adj.P.Val=padj,label=row.names(sig_diff)) sig_diff$status <- 'stable' sig_diff$status[sig_diff$logFC>1 & sig_diff$adj.P.Val<0.05] <- 'up' sig_diff$status[sig_diff$logFC< -1 & sig_diff$adj.P.Val<0.05] <- 'down' # 挑选出差异基因子集，用于后续添加label #按需求调整这里的参数，倍数和P值是否按此标准？ labeldata <- subset(sig_diff, abs(logFC)>1 & adj.P.Val<0.05) id <- order(-log10(labeldata$adj.P.Val),decreasing = T) labeldata <- labeldata[id,]

# (1)基础版------------------------------------------- ggplot(sig_diff,aes(logFC,-log10(adj.P.Val)))+ # 绘制散点 geom_point(aes(color=status), size=1.5)+ # 绘制垂直线 geom_vline(xintercept = c(log2(1/2),log2(2)), linetype = "dashed")+ # 绘制水平线 geom_hline(yintercept = -log10(0.05), linetype = "dashed")+ theme_bw()+ # 设置点的颜色 scale_color_manual(values = c('red','grey','blue'))+ # 设置标签注释 geom_text_repel(data = labeldata[1:10,], aes(label = label,color=status), size=2.5)+ # 横轴标题 xlab('Log2FC')+ # 纵轴标题 ylab('-Log10(adjPvalue)') ggsave('plot1.pdf',width = 7,height = 5.5)

# (2)进阶版------------------------------------------- ggplot(data = sig_diff, mapping = aes( x=logFC, y=-log10(adj.P.Val)))+ # 绘制散点 geom_point(aes(color=status, size= -log10(adj.P.Val)), alpha=1)+ # 绘制垂直线 geom_vline(xintercept = c(log2(1/2),log2(2)), linetype = "dashed")+ # 绘制水平线 geom_hline(yintercept = -log10(0.05), linetype = "dashed")+ theme_bw()+ # 设置点的颜色 scale_color_manual(values = c('#ffcc00','grey','#0066ff'))+ # 设置点的大小范围 scale_size_continuous(range = c(0.3,3)) + # 设置标签注释 geom_text_repel(data = labeldata[1:10,], mapping = aes(label = label,color=status), size=2)+ # 横轴标题 xlab('Log2FC')+ # 纵轴标题 ylab('-Log10(adjPvalue)') ggsave('plot2.pdf',width = 7,height = 5.5)

# (3)彩色渐变版------------------------------------------- ggplot(data = degdata, mapping = aes( x=logFC, y=-log10(adj.P.Val)))+ # 绘制散点 geom_point(aes(color=-log10(adj.P.Val), size= -log10(adj.P.Val)), alpha=1)+ # 绘制垂直线 geom_vline(xintercept = c(log2(1/2),log2(2)), linetype = "dashed", color='#363636')+ # 绘制水平线 geom_hline(yintercept = -log10(0.05), linetype = "dashed", color='#363636')+ theme_bw()+ # 设置点的颜色 scale_color_gradientn(colours = brewer.pal(11,'RdYlBu') %>% rev())+ # 设置点的大小范围 scale_size_continuous(range = c(0.3,3)) + # 设置标签注释 geom_text_repel(data = labeldata[1:10,], mapping = aes(label = label, color=-log10(adj.P.Val)), size=2.5)+ # 横轴标题 xlab('Log2FC')+ # 纵轴标题 ylab('-Log10(adjPvalue)')+ # 去除一个图例，设置另一个图例的标题 guides(size=FALSE, color=guide_colorbar(title = '-Log10(adjPvalue)')) # guide_colorbar是设置连续型变量图例的 ggsave('plot3.pdf',width = 7,height = 5.5)

4.富集分析部分

使用网站教程

https://zhuanlan.zhihu.com/p/75235912

https://blog.csdn.net/Specally/article/details/145685683

使用R代码

https://zhuanlan.zhihu.com/p/597338272

https://zhuanlan.zhihu.com/p/518143539

https://www.zhihu.com/question/53616723/answer/2555192191

教程

https://www.bilibili.com/video/BV1HU4y157Cq/?spm_id_from=333.337.search-card.all.click&vd_source=07b3ae20355d18fee7c57b4801b7d648

https://www.bilibili.com/video/BV1ojCFY5EeU?spm_id_from=333.788.videopod.episodes&vd_source=07b3ae20355d18fee7c57b4801b7d648&p=2

零基础入门转录组下游分析——DESeq2差异分析

https://zhuanlan.zhihu.com/p/24803546705

https://zhuanlan.zhihu.com/p/534616959

Python教程

https://www.zhihu.com/search?type=content&q=deseq%E5%88%86%E6%9E%90

1.3 在R中对于转录组数据都有哪些差异分析的方法？

包括但不局限于DESeq2、limma和edgeR等。 DESeq2：基于负二项分布模型进行差异分析，考虑了测序深度和基因长度对count数据的影响，因此在处理高度变异的基因时表现较好。DESeq2还提供了多种标准化方法，可以帮助减少批次效应和其他技术噪声。 limma：最初是为微阵列数据设计的，后来也扩展到RNA-Seq数据分析。它使用线性模型进行差异分析，并且可以轻松处理多个因素和协变量。limma的优势在于其速度和灵活性，可以处理大量的基因和样本。 edgeR：也是基于负二项分布模型进行差异分析，但与DESeq2不同的是，它使用了经验贝叶斯方法来估计分散参数，从而提高了差异检验的准确性和稳定性。edgeR在处理大型转录组数据集时表现出色，尤其是在识别低表达水平的差异基因方面。

1.4 三种方法对于输入数据的要求？

DESeq2：需要原始的count矩阵（如通过htseq-count工具获得），并且只支持重复样品。 limma：可以接受原始的count矩阵，但需要用户自行进行标准化（通常是log转换），并且也支持重复样品。 edgeR：要求输入原始的count矩阵，既支持单个样品也支持重复样品。

1.5 在做转录组分析的时候该选用哪种方法？（关键）

GEO芯片数据——用limma差异分析，因为GEO芯片数据底层已经对数据做了标准化处理。

https://mp.weixin.qq.com/s/9lOvTM9zqFUtgdZKV87hlQ

GEO高通量数据——用DESeq2差异分析，因为可以获取到count数据。

https://mp.weixin.qq.com/s/fEDSeN_pWAWWybE578lVpg

TCGA_count数据——用DESeq2差异分析，因为可以获取到count数据

https://mp.weixin.qq.com/s/nkQdivpDxENkra6snRGuXA

自测序数据——用DESeq2差异分析，因为可以获取到count数据。

https://mp.weixin.qq.com/s/OKUbIwi6y_hNPz2DWkOqvw

数学教程

https://www.zhihu.com/search?type=content&q=deseq%E5%88%86%E6%9E%90

GEO

https://www.dxy.cn/bbs/newweb/pc/post/46972686

机器学习

文献

Brain-inspired learning in artificial neural networks: A review

https://pubs.aip.org/aip/aml/article/2/2/021501/3291446/Brain-inspired-learning-in-artificial-neural?gad_source=1&gclid=Cj0KCQiA4fi7BhC5ARIsAEV1YibV0cscFWPastq-ljPCSpxPQ6MFapk8mRuxVO9yID2tt3pn1uMyMi0aAsgTEALw_wcB

多组学数据研究

https://biodatamining.biomedcentral.com/articles/10.1186/s13040-024-00391-z

UMAP相关

https://blog.csdn.net/qq_36810544/article/details/81094469

https://medium.com/@jonas.nordst/decoding-the-brain-d0bf4e1fd67d

https://www.sciencedirect.com/science/article/pii/S2589004222018028

Unet

https://blog.csdn.net/qq_40844444/article/details/143852979

https://zhuanlan.zhihu.com/p/717610325

深度神经网络

https://blog.csdn.net/weixin_73749601/article/details/144667163

基础讲解

理论讲解

https://zhuanlan.zhihu.com/p/693940030

基于R的机器语言学习

第一章什么是模型

第二章监督学习与无监督机器学习

分类

监督模型

回归 Regression

Regression These models are very common, and it’s likely that you encountered one in high school math classes. They are primarily used for looking at how data evolves with respect to another variable (e.g., time) and examining what you can do to predict values in the future.

回归这些模型非常常见，你很可能在高中数学课上遇到过。它们主要用于查看数据如何随另一个变量（例如，时间）演变，并检查您可以做些什么来预测未来的值。

连续的数值作为输入，输出连续数值

建立模型

评估回归模型预测准确度

多重R方值

校正R方值

建立训练集-测试集

首先，我们使用lm（）在训练数据上计算一个新的线性模型。接下来，我们根据测试集的disp数据列组建一个数据框。在那之后，我们预测测试集输出并将其存储在测试数据的新列中。最后，我们计算一个均方根误差（RMSE）项。我们通过取模型输出和已知的MPG效率差值的平方，求和这些平方，然后除以数据集中条目的总数。这就给出了残差标准误差。新值与我们之前看到的不同，对于理解我们的模型是如何运行的是一个重要的值

分类 Classification

分类这些模型用于将数据组织成可分类的方案有意义。例如，考虑前面提到的商店标签示例——商店每周销量超过10部的游戏可以被归类为表现良好的游戏，而那些销量低于这一数字的公司将被归类为糟糕的公司。

逻辑回归不同于线性回归，因为我们得到离散输出而不是连续的。以前，我们可以得到任何数字作为我们回归的结果 Sion模型，但是在我们的逻辑模型中，我们应该期待一个二元结果传输类型;它要么是自动变速器，要么不是。方法这里也不一样。

首先，你需要加载cattools库：

这个库包含许多函数，但最重要的是，它有一个用于逻辑的函数 tic回归：LogitBoost。首先，您需要给模型一个标签，来决定预测以及你想要用来训练模型的数据：

Logicboost算法来源

https://blog.csdn.net/u014568921/article/details/44860779

Adaboost

Adaboost是一种迭代算法，其核心思想是针对同一个训练集训练不同的分类器(弱分类器)，然后把这些弱分类器集合起来，构成一个更强的最终分类器（强分类器）。AdaBoost是一种具有一般性的分类器提升算法，它使用的分类器并不局限某一特定算法。其算法本身是通过改变数据分布来实现的，它根据每次训练集之中每个样本的分类是否正确，以及上次的总体分类的准确率，来确定每个样本的权值。将修改过权值的新数据集送给下层分类器进行训练，最后将每次训练得到的分类器最后融合起来，作为最后的决策分类器。整个过程如下所示： 1.先通过对N个训练样本的学习得到第一个弱分类器； 2.将分错的样本和其他的新数据一起构成一个新的N个的训练样本，通过对这个样本的学习得到第二个弱分类器； 3.将1和2都分错了的样本加上其他的新样本构成另一个新的N个的训练样本，通过对这个样本的学习得到第三个弱分类器； 4.如此反复，最终得到经过提升的强分类器。 ———————————————— 版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/u014568921/article/details/44860779

https://blog.csdn.net/u014568921/article/details/49474293

Logicboost单独算法介绍

https://blog.csdn.net/weixin_41368414/article/details/134323872

https://zhuanlan.zhihu.com/p/75731049

最大似然估计

连续的数值作为输入，输出离散的数值，反之亦然

混合 Mixed

混合这些模型通常可以依赖于回归的一部分来告知如何进行分类，有时则先分类后回归。一个例子可能是查看销售数据随时间的推移和某一时间内直线的斜率是否有快速变化期

杂志

https://home.liebertpub.com/publications/ai-in-neuroscience/681

杂志点（可以参考学习有哪些能写）

Aims & Scope AI in Neuroscience is a multidisciplinary peer-reviewed research journal dedicated to exploring the relationship between artificial intelligence (AI) and neuroscience. Broad in scope, the journal aims to publish high-quality peer-reviewed content concerning the application of artificial intelligence (AI) methods to all areas of neuroscience, including clinical, cognitive and behavioral, computational, systems, and molecular and cellular neuroscience. By fostering collaboration and knowledge exchange, the journal seeks to advance understanding, develop new technologies, and address current challenges and opportunities. Topics of Interest: Computational Neuroscience: Investigating neural dynamics, models and simulations of the nervous system, information processing, and brain function using computational models and AI techniques. Neuroimaging and AI: Developing AI-driven methods for analysis, interpretation, and visualization of neuroimaging data from techniques such as MRI, EEG, fMRI, and MEG. Neuroinformatics: Integrating AI tools and techniques for managing, analyzing, and sharing large-scale neuroscience data, including genomic, connectomic, and behavioral data; Explore insights into gene expression, protein function, and cellular signaling in the nervous system. Brain-Computer Interfaces (BCIs): Advancing the design and application of BCIs for communication, control, and neurorehabilitation through AI algorithms. Machine Learning in Neuroscience: Exploring machine learning algorithms for pattern recognition, classification, prediction, and feature extraction in neuroscience datasets. AI-driven Neuroscience Therapeutics: Investigating the potential of AI for drug discovery, treatment optimization, and personalized medicine in neurological and psychiatric disorders. Cognitive Neuroscience and AI: Exploring AI-driven approaches to understanding cognitive processes, decision-making, learning, memory, and consciousness. Ethical and Societal Implications: Examining ethical, legal, and societal implications of AI applications in neuroscience, including issues of privacy, bias, and autonomy. AIN publishes original research articles, reviews, descriptions of new published datasets and code, commentaries, and perspectives across the spectrum of laboratory investigations, translational research, and clinical practice. Submissions employing diverse methodologies, including computational modeling, experimental neuroscience, data science, and translational research, are encouraged. Our goal is to facilitate interdisciplinary dialogue and collaboration to harness the potential of AI in revealing the complexities of the brain and advancing both neuroscience and AI fields. Audience: Neuroscientists, neuroimaging specialists, neurologists, neurobiologists, neurosurgeons, neuropsychologists, bioengineers, AI researchers, and industry professionals in pharmaceuticals, biotechnology, healthcare technology, and AI.

机器学习鉴定单细胞

https://mp.weixin.qq.com/s/ItZxGFEn2B1lujOg0b6LHg

机器学习原理

卷积CNN

https://zhuanlan.zhihu.com/p/693588702

https://zhuanlan.zhihu.com/p/65472471

https://zhuanlan.zhihu.com/p/632712454

需要学习哪些知识？机器学习包含哪些方面？

【深度学习-第2篇】CNN卷积神经网络30分钟入门！足够通俗易懂了吧（图解）

https://zhuanlan.zhihu.com/p/635438713

引申

什么是卷积

https://zhuanlan.zhihu.com/p/526705694

再引申

https://www.zhihu.com/question/29877892/answer/150456142

https://zhuanlan.zhihu.com/p/523565858

https://www.zhihu.com/question/22298352

数学角度

https://vinodsblog.com/2021/10/21/powerful-math-behind-convolutional-neural-networks-cnns/#:~:text=Convolution%20Operations%20%E2%80%93%20Convolution%20is%20a,as%20a%20smaller%202D%20array.

U-net（CNN）的一种，应用与切割医学图像

https://blog.csdn.net/qq_40306845/article/details/134078180

https://zhuanlan.zhihu.com/p/11269446228

https://blog.csdn.net/z0n1l2/article/details/83515904 机器学习中的求导

常用成本函数

https://zhuanlan.zhihu.com/p/163215464