R字符串操作笔记

设计师Yoyo

2018-01-02

本文章参考：https://www.cnblogs.com/Richardzhu/archive/2013/12/03/3455806.html

文本长度：nchar这个函数简单，统计向量中每个元素的字符个数，注意这个函数和length函数的差别：nchar是向量元素的字符个数，而length是向量长度（向量元素的个数）

> x <- c("Hellow", "World", "!") 
> nchar(x) 
[1] 6 5 1 
> length(''); nchar('') 
[1] 1 
[1] 0

大小写转化，文字替换

> DNA <- "AtGCtttACC" 
> tolower(DNA) 
[1] "atgctttacc" 
> toupper(DNA) 
[1] "ATGCTTTACC" 
> chartr("Tt", "Uu", DNA) 
[1] "AuGCuuuACC" 
> chartr("Tt", "UU", DNA) 
[1] "AUGCUUUACC"

详细的常用操作实例说明

获取字符串长度：nchar()能够获取字符串的长度，它也支持字符串向量操作。注意它和length()的结果是有区别的。
字符串粘合：paste()负责将若干个字符串相连结，返回成单独的字符串。其优点在于，就算有的处理对象不是字符型也能自动转为字符型。
字符串分割：strsplit()负责将字符串按照某种分割形式将其进行划分，它正是paste()的逆操作。
字符串截取：substr()能对给定的字符串对象取出子集，其参数是子集所处的起始和终止位置。
字符串替代：gsub()负责搜索字符串的特定表达式，并用新的内容加以替代。sub()函数是类似的，但只替代第一个发现结果。
字符串匹配：grep()负责搜索给定字符串对象中特定表达式，并返回其位置索引。grepl()函数与之类似，但其后面的"l"则意味着返回的将是逻辑值。

字符(串)的格式化(定制)输出：R中将字符或字符串按照一定的格式和要求输出。

字符串分割函数：strsplit()
字符串连接函数：paste()及paste0()
计算字符串长度：nchar()及length()
字符串截取函数：substr()及substring()
字符串替换函数：chartr()、sub()及gsub()
字符串匹配函数：grep()及grepl()
大小写转换函数：toupper()、tolower()及casefold()
字符(串)的格式化(定制)输出函数：sprintf()、sink()、cat()、print()、strtrim()、strwrap()

详见网址：https://www.cnblogs.com/awishfullyway/p/6601539.html

很惊喜的发现http://blog.sina.com.cn/s/blog_72ef7bea0101cgrp.html，整理的非常好

字符处理
Encoding(x) Encoding(x) <- value enc2native(x) enc2utf8(x) 读取或设置字符向量的编码	> ## x is intended to be in latin1 > x <- "fa\xE7ile" > Encoding(x) [1] "latin1" > Encoding(x) <- "latin1" > xx <- iconv(x, "latin1", "UTF-8") > Encoding(c(x, xx)) [1] "latin1" "UTF-8" > Encoding(xx) <- "bytes" # will be encoded in hex > cat("xx = ", xx, "\n", sep = "") xx = fa\xc3\xa7ile
nchar(x, type = "chars", allowNA = FALSE) 返回字符长度，在我的测试中allowNA参数没有作用？ nzchar(x) 判断是否空字符对于缺失值NA，nchar和nzchar函数认为是字符数为2的字符串。所以在对字符串进行测量之前，最好先使用is.na()函数判断一下是否是NA。对于NULL，nchar和nzchar函数会忽略掉。	> nchar(c("em","yqu","",NA)) [1] 2 3 0 2 > nzchar(c("em","yqu","",NA)) [1] TRUE TRUE FALSE TRUE> nzchar(c("em","yqu",NULL,"",NA)) [1] TRUE TRUE FALSE TRUE > nchar(c("em","yqu",NULL,"",NA)) [1] 2 3 0 2 > nchar(NULL) integer(0) > nzchar(NULL) logical(0)
substr(x, start, stop) substring(text, first, last = 1000000L) substr(x, start, stop) <- value substring(text, first, last = 1000000L) <- value 提取或替换字符向量的子字段，substring同substr功能一样，兼容S语言。参数start大于stop时，抽取时返回""，替换时无操作。如果x包含NA，对应结果为NA。	> substr("abcdef", 2, 4) [1] "bcd" > substr("abcdef", -3, 9) [1] "abcdef" > substring("abcdef", 1:6, 1:6) [1] "a" "b" "c" "d" "e" "f" > x <-c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech") > substring(x, 2, 4:5) [1] "sfe" "wert" "uio" "" "tuf"
strtrim(x, width) 按显示宽度截断字符串	> x<-c("abcdef",NA,"66") > strtrim(x,c(2,1,3)) [1] "ab" NA "66"
paste (..., sep = " ", collapse = NULL) paste0(..., collapse = NULL) 通过sep连接间隔连接对象,返回字符串向量设定collapse的话，会通过collapse连接间隔将上一步的字符串向量连接成一个字符串 paste0(..., collapse)等同于paste(..., sep = "", collapse)	> paste(1:6) # same as as.character(1:6) [1] "1" "2" "3" "4" "5" "6" > paste("A", 1:6, sep = "=") [1] "A=1" "A=2" "A=3" "A=4" "A=5" "A=6" > paste("A", 1:6, sep = "=", collapse=";") [1] "A=1;A=2;A=3;A=4;A=5;A=6"
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) 基于split子句分割字符向量x fixed为TRUE的话，完全匹配split；否则，基于正则表达式可以使用split=NULL来分割每个字符。	> x <- c(as = "mfe", qu = "qwerty", "70", "yes") > strsplit(x, "e") $as [1] "mf" $qu [1] "qw" "rty" [[3]] [1] "70" [[4]] [1] "y" "s" > strsplit("Hello world!", NULL) [[1]] [1] "H" "e" "l" "l" "o" " " "w" "o" "r" "l" "d" "!" > ## Note that 'split' is a regexp! > unlist(strsplit("a.b.c", ".")) [1] "" "" "" "" "" > ## If you really want to split on '.', use > unlist(strsplit("a.b.c", "[.]")) [1] "a" "b" "c" > unlist(strsplit("a.b.c", ".", TRUE)) [1] "a" "b" "c"
字符转换和大小写转换
chartr(old, new, x) 将x中的字符old变换为字符new	> x <- "MiXeD cAsE 123" > chartr("iXs", "why", x) [1] "MwheD cAyE 123" > chartr("a-cX", "D-Fw", x) [1] "MiweD FAsE 123"
tolower(x) toupper(x) casefold(x, upper = FALSE) casefold是为了兼容S-PLUS而实现的 tolower和toupper函数封装器。	> x <- "MiXeD cAsE 123" > tolower(x) [1] "mixed case 123" > toupper(x) [1] "MIXED CASE 123"
格式化输出
sprintf(fmt, ...) 系统C库函数sprintf封装器	> sprintf("%s is %f feet tall\n", "Sven", 7.1) [1] "Sven is 7.100000 feet tall\n"
format 格式化输出 formatC 格式化（C语言风格）输出
strwrap(x, width = 0.9 * getOption("width"), indent = 0, exdent = 0, prefix = "", simplify = TRUE, initial = prefix) 将字符串封装成格式化段落	> str <- "Now is the time " > strwrap(str, width=60,indent=1) [1] " Now is the time" > strwrap(str, width=60,indent=2) [1] " Now is the time" > strwrap(str, width=60,indent=3) [1] " Now is the time" > strwrap(str, prefix="kx>") [1] "kx>Now is the time"
字符串匹配
pmatch(x, table, nomatch = NA_integer_, duplicates.ok = FALSE) 局部字符串匹配，返回匹配的下标。 pmatch的行为因duplicates.ok参数而异。当duplicates.ok为TRUE，有完全匹配的情况返回第一个完全匹配的下标，否则有唯一一个局部匹配的情况返回该唯一一个局部匹配的下标，没有匹配则返回nomatch参数值。空字符串与任何字符串都不匹配，甚至是空字符串。当duplicates.ok为FALSE，table中的值一旦匹配都被排除用于后继匹配，空字符串例外。 NA被视为字符常量"NA"。	> pmatch(c("", "ab", "ab"), c("abc", "ab"), dup = FALSE) [1] NA 2 1 > pmatch(c("", "ab", "ab"), c("abc", "ab"), dup = TRUE) [1] NA 2 2 > pmatch("m", c("mean", "median", "mode")) # returns NA [1] NA
charmatch(x, table, nomatch = NA_integer_) 局部字符串匹配，返回匹配的下标。 charmatch与uplicates.ok为TRUE的pmatch近似，当有单个完全匹配的情况返回该完全匹配的下标，否则有唯一一个局部匹配的情况返回该唯一一个局部匹配的下标，有多个完全匹配或局部匹配返回0，没有匹配则返回nomatch参数值。 charmatch允许匹配空字符串。 NA被视为字符常量"NA"。	> charmatch(c("", "ab", "ab"), c("abc","ab")) [1] 0 2 2 > charmatch("m", c("mean", "median", "mode")) # returns 0 [1] 0
match(x, table, nomatch = NA_integer_, incomparables = NULL) x %in% table 值匹配，不限于字符串	> sstr <- c("e","ab","M",NA,"@","bla","P","%") > sstr[sstr %in% c(letters, LETTERS)] [1] "e" "M" "P"
模式匹配和替换
grep(pattern,x,ignore.case=FALSE, perl=FALSE,value=FALSE,fixed=FALSE, useBytes=FALSE,invert=FALSE) 返回匹配下标 grepl(pattern,x,ignore.case=FALSE, perl=FALSE,fixed=FALSE,useBytes=FALSE) 返回匹配逻辑结果 sub(pattern,replacement,x,ignore.case=FALSE, perl=FALSE,fixed=FALSE,useBytes=FALSE) 替换第一个匹配的字符串 gsub(pattern,replacement,x,ignore.case=FALSE, perl=FALSE,fixed=FALSE,useBytes=FALSE) 替换全部匹配的字符串 regexpr(pattern,text,ignore.case=FALSE, perl=FALSE,fixed=FALSE,useBytes=FALSE) 返回第一个匹配的下标和匹配长度 gregexpr(pattern,text,ignore.case=FALSE, perl=FALSE,fixed=FALSE,useBytes=FALSE) 返回全部匹配的下标和匹配长度 regexec(pattern,text,ignore.case=FALSE, fixed=FALSE,useBytes=FALSE) 返回第一个匹配的下标和匹配长度这些函数(除了不支持Perl风格正则表达式的regexec函数)可以工作在三种模式下: fixed = TRUE: 使用精确匹配 perl = TRUE: 使用Perl风格正则表达式 fixed = FALSE且perl = FALSE: 使用POSIX 1003.2扩展正则表达式 useBytes = TRUE时逐字节匹配，否则逐字符匹配。其主要作用是避免对多字节字符码中无效输入和虚假匹配的错误/告警,但是对于regexpr，它改变了输出的解释。它会阻止标记编码的输入进行转换，尤其任一输入被标记为“字节”时强制禁止转换。	> str<-c("Now is ","the"," time ") > grep(" +", str) [1] 1 3 > grepl(" +", str) [1] TRUE FALSE TRUE > sub(" +", "", str) [1] "Nowis " "the" "time " > sub("[[:space:]]+", "", str) ## white space, POSIX-style [1] "Nowis " "the" "time " > sub("\\s+", "", str, perl = TRUE) ## Perl-style white space [1] "Nowis " "the" "time " > gsub(" +", "", str) [1] "Nowis" "the" "time" > regexpr(" +", str) [1] 4 -1 1 attr(,"match.length") [1] 1 -1 1 attr(,"useBytes") [1] TRUE > gregexpr(" +", str) [[1]] [1] 4 7 attr(,"match.length") [1] 1 1 attr(,"useBytes") [1] TRUE [[2]] [1] -1 attr(,"match.length") [1] -1 attr(,"useBytes") [1] TRUE [[3]] [1] 1 6 attr(,"match.length") [1] 1 2 attr(,"useBytes") [1] TRUE > regexec(" +", str) [[1]] [1] 4 attr(,"match.length") [1] 1 [[2]] [1] -1 attr(,"match.length") [1] -1 [[3]] [1] 1 attr(,"match.length") [1] 1
regmatches(x, m, invert = FALSE) regmatches(x, m, invert = FALSE) <- value 抽取或替换正则表达式匹配子串 invert = TRUE则抽取或替换不匹配子串	> str<-c("Now is ","the"," time ") > m<-regexpr(" +",str) > regmatches(str,m)<- "kx" > str [1] "Nowkxis " "the" "kxtime " > > str<-c("Now is ","the"," time ") > m<-gregexpr(" +",str) > regmatches(str,m, invert=TRUE)<- "kx" > str [1] "kx kx kx" "kx" "kx kx kx"
agrep(pattern, x, max.distance = 0.1, costs = NULL, ignore.case = FALSE, value = FALSE, fixed = TRUE, useBytes = FALSE) agrepl(pattern, x, max.distance = 0.1, costs = NULL, ignore.case = FALSE, fixed = TRUE, useBytes = FALSE) 使用广义Levenshtein编辑距离进行字符串近似匹配待进一步研究	> str <- c("1 lazy", "1", "1 LAZY") > agrep("laysy", str, max = 2) [1] 1
grepRaw(pattern, x, offset = 1L, ignore.case = FALSE, value = FALSE, fixed = FALSE, all = FALSE, invert = FALSE) 对原始数据向量进行模式匹配	> raws <- charToRaw("Now is the time ") > raws [1] 4e 6f 77 20 69 73 20 74 68 65 20 74 69 6d 65 20 > grepRaw(charToRaw(" +"),raws) [1] 4
glob2rx(pattern, trim.head = FALSE, trim.tail = TRUE) 将通配符模式变成正则表达式	> glob2rx("abc.") [1] "^abc\\." > glob2rx("a?b.") [1] "^a.b\\." > glob2rx("a?b.", trim.tail = FALSE) [1] "^a.b\\..$" > glob2rx(".doc") [1] "^.\\.doc$" > glob2rx(".doc", trim.head = TRUE) [1] "\\.doc$" > glob2rx(".t") [1] "^.\\.t" > glob2rx(".t??") [1] "^.\\.t..$" > glob2rx("[") [1] "^.*\\["