加入收藏 | 设为首页 | 会员中心 | 我要投稿 威海站长网 (https://www.0631zz.com/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

LearningR-数据处理

发布时间:2021-03-14 22:18:21 所属栏目:大数据 来源:网络整理
导读:R自带函数 reshape2 data restructuring dplyr data aggregation tidyr 待整理 字符串处理 1. R自带函数 1.1 转置 使用函数t()可对一个矩阵或数据框进行转置,对于数据框,行名将变成变量(列)名。 cars - mtcars(1:5,1:4)carst(cars) 数列array进行维度转
副标题[/!--empirenews.page--]

  1. R自带函数

  2. reshape2
    data restructuring

  3. dplyr
    data aggregation

  4. tidyr
    待整理

  5. 字符串处理


1. R自带函数

1.1 转置

使用函数t()可对一个矩阵或数据框进行转置,对于数据框,行名将变成变量(列)名。

cars <- mtcars(1:5,1:4)
cars
t(cars)

数列array进行维度转换 aperm

x <- array(1:24,2:4)
xt <- aperm(x,c(2,1,3))
dim(x)
dim(xt)

1.2 整合数据aggregate

在R中使用一个或多个by变量和一个预先定义好的函数来折叠(collapse)数据。调用格式为:

aggregate(x,by,FUN)

其中x是待折叠的数据对象,by饰一个变量名组成的列表,这些变量将被去掉以新的观测,而FUN则是用来计算表述性统计量的标量函数,它将被用来计算新观测中的值。

options(digits=2)
attach(mtcars)
mydata <- aggregate(mtcars,by=list(cyl,gear),FUN=mean,na.rm=TRUE)
mydata

by中的变量必须在一个列表中(即使只有一个变量)。也可以在列表中为各组声明自定义的名称,例如by=list(Group.cyl=cyl,Group.gears=gear)。

## example with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,NA,4,9),v2 = c(11,33,55,77,88,44,99) )
by1 <- c("red","blue",2,"big","red",12)
by2 <- c("wet","dry",99,95,"damp",NA)
aggregate(x = testDF,by = list(by1,by2),FUN = "mean")

# and if you want to treat NAs as a group
fby1 <- factor(by1,exclude = "")
fby2 <- factor(by2,exclude = "")
aggregate(x = testDF,by = list(fby1,fby2),FUN = "mean")

## Formulas,one ~ one,one ~ many,many ~ one,and many ~ many:
aggregate(weight ~ feed,data = chickwts,mean)
aggregate(breaks ~ wool + tension,data = warpbreaks,mean)
aggregate(cbind(Ozone,Temp) ~ Month,data = airquality,mean)
aggregate(cbind(ncases,ncontrols) ~ alcgp + tobgp,data = esoph,sum)

## Dot notation:
aggregate(. ~ Species,data = iris,mean)
aggregate(len ~ .,data = ToothGrowth,mean)

## Often followed by xtabs():
ag <- aggregate(len ~ .,mean)
xtabs(len ~ .,data = ag)

## Compute the average annual approval ratings for American presidents.
aggregate(presidents,nfrequency = 1,FUN = mean)
## Give the summer less weight.
aggregate(presidents,FUN = weighted.mean,w = c(1,0.5,1))

1.3 apply

待整理

1.4 union和intersect

x <- c(sort(sample(1:20,9)),NA)
y <- c(sort(sample(3:23,7)),NA)
union(x,y)
intersect(x,y)
setdiff(x,y)
setdiff(y,x)
setequal(x,y)
#%in%
(1:10) %in% c(3,12)
"%w/o%" <- function(x,y) x[!x %in% y]
(1:10) %w/o% c(3,12)
sstr <- c("c","ab","B","bba","c","@","bla","a","Ba","%")
sstr %in% c(letters,LETTERS)

1.5 合并 cbind和rbind

纵向合并数据通常用于向数据框中添加观测。

  • rbind() :纵向合并两个数据框(数据集)

  • cbind() :横向合并两个数据框(数据集)

注:两个数据框行(列)数必须相同。如果x中拥有y中没有的变量,在合并它们之前需做以下处理:

(1)删除dataframeA中的多余变量;

(2)在dataframeB中创建追加的变量并将其值设为NA(缺失)。

x1 <- c(1:5)
x2 <- c(21:25)
x3 <- c(31:35)
r1 <- cbind(x1,x2)
r2 <- rbind(x1,x2)
r31 <- cbind(r1,x3)
r32 <- rbind(r2,x3)

1.6 匹配合并 merge

merge效果同dplyr的join,join的效力更高。

  • inner_join 等价于 merge(all=F)

  • left_join 等价于 merge(all.x=T,all.y=F)

  • right_join 等价于 merge(all.x=F,all.y=T)

  • full_join 等价于 merge(all=T)

#authors和books
authors <- data.frame(
    surname = I(c("Tukey","Venables","Tierney","Ripley","McNeil")),nationality = c("US","Australia","US","UK","Australia"),deceased = c("yes",rep("no",4)))
books <- data.frame(
    name = I(c("Tukey","McNeil","R Core")),title = c("Exploratory Data Analysis","Modern Applied Statistics ...","LISP-STAT","Spatial Statistics","Stochastic Simulation","Interactive Data Analysis","An Introduction to R"),other.author = c(NA,"Venables & Smith"))

m1 <- merge(authors,books,by.x = "surname",by.y = "name")
m2 <- merge(books,authors,by.x = "name",by.y = "surname")
#m1和m2结果相同,只是结果的列名不同。
#left_join
m3 <- merge(authors,by.y = "name",all.x = T,all.y = F)
#right_join
m4 <- merge(authors,all.x = F,all.y = T)
#full_join
m5 <- merge(authors,all = TRUE)

m11 <- inner_join(authors,by=c("surname"="name"))
m22 <- inner_join(books,by=c("name"="surname"))
m33 <- left_join(authors,by=c("surname"="name"))
m44 <- right_join(authors,by=c("surname"="name"))
m55 <- full_join(authors,by=c("surname"="name"))

1.7 排除重复数据 unique

unique 函数可以去掉向量、数据框或类似数列的数据中重复的元素。

x <- c(9:20,1:5,3:7,0:8)
y <- unique(x)
#下列方式业可以,但unique方式效率更高.
#duplicated 函数返回了元素是否重复的逻辑值.
y1 <- x[!duplicated(x)]

2. reshape2包

首先将数据“融合”(melt),以使每一行都是一个唯一的标识符-变量组合。
然后将数据“重铸”(cast),可以使用任何函数对数据进行整合成想要的任何形状。

(编辑:威海站长网)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

热点阅读