UTF8字符串在lua的截取和字数统计【转载】

转载自：GitHub:pangliang/pangliang.github.com

需求

按字面个数来截取

函数(字符串, 开始位置, 截取长度)

utf8sub("你好1世界哈哈",,)    =    好1世界哈

utf8sub("1你好1世界哈哈",,)    =    你好1世界

utf8sub("你好世界1哈哈",,)    =    你好世界1

utf8sub("",,)    =

utf8sub("øpø你好pix",,)    =    pø你好p

错误方法

网上找了一些算法, 都不太正确; 要么就是乱码, 要么就是只考虑了4 byte 中文的情况, 不够全面

1. string.sub(s,1,截取长度*4)

　　网上很多直接使用"`""string.sub(s,1,截取长度*4)`"是肯定不对的, 因为如果中英文混合的字符串, 例如`你好1世界`的字符长度分别是`4,4,1,4,4`, 如果截取4个字, 4*4=4+4+1+4+3, 那`世界`的`界`字将会被取前3个byte, 就会出现乱码

2. if byte>128 then index = index + 4

问题关键

1. utf8字符是变长字符

2. 字符长度有规律

UTF-8字符规律

字符串的首个byte表示了该utf8字符的长度

0xxxxxxx - 1 byte

110yxxxx - 192, 2 byte

1110yyyy - 225, 3 byte

11110zzz - 240, 4 byte

正确算法

 --

 -- lua

 -- 判断utf8字符byte长度

 -- 0xxxxxxx - 1 byte

 -- 110yxxxx - 192, 2 byte

 -- 1110yyyy - 225, 3 byte

 -- 11110zzz - 240, 4 byte

 local function chsize(char)

     if not char then

         print("not char")

         return

     elseif char >  then

         return

     elseif char >  then

         return

     elseif char >  then

         return

     else

         return

     end

 end

 -- 计算utf8字符串字符数, 各种字符都按一个字符计算

 -- 例如utf8len("1你好") => 3

 function utf8len(str)

     local len =

     local currentIndex =

     while currentIndex <= #str do

         local char = string.byte(str, currentIndex)

         currentIndex = currentIndex + chsize(char)

         len = len +

     end

     return len

 end

 -- 截取utf8 字符串

 -- str:            要截取的字符串

 -- startChar:    开始字符下标,从1开始

 -- numChars:    要截取的字符长度

 function utf8sub(str, startChar, numChars)

     local startIndex =

     while startChar >  do

         local char = string.byte(str, startIndex)

         startIndex = startIndex + chsize(char)

         startChar = startChar -

     end

     local currentIndex = startIndex

     while numChars >  and currentIndex <= #str do

         local char = string.byte(str, currentIndex)

         currentIndex = currentIndex + chsize(char)

         numChars = numChars -

     end

     return str:sub(startIndex, currentIndex - )

 end

 -- 自测

 function test()

     -- test utf8len

     assert(utf8len("你好1世界哈哈") == )

     assert(utf8len("你好世界1哈哈 ") == )

     assert(utf8len(" 你好世 界1哈哈") == )

     assert(utf8len("") == )

     assert(utf8len("øpø你好pix") == )

     -- test utf8sub

     assert(utf8sub("你好1世界哈哈",,) == "好1世界哈")

     assert(utf8sub("1你好1世界哈哈",,) == "你好1世界")

     assert(utf8sub(" 你好1世界 哈哈",,) == "你好1世界 ")

     assert(utf8sub("你好世界1哈哈",,) == "你好世界1")

     assert(utf8sub("",,) == "")

     assert(utf8sub("øpø你好pix",,) == "pø你好p")

     print("all test succ")

 end

 test()

巴特西

UTF8字符串在lua的截取和字数统计【转载】

需求

按字面个数来截取

错误方法

问题关键

正确算法

最新文章

热门文章