java7版本中可以这样写:

source.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "*");

java6和java7版本中可以这样写:

source.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "*");

Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.

This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.

Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.

Behind the scene

Assuming that you are running your regex on Oracle's implementation, your regex

"([\ud800-\udbff\udc00-\udfff])"

is compiled as such:

StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
Pattern.union. A ∪ B:
Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
BitClass. Match any of these 1 character(s):
[U+002D]
SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match

The character class is parsed as \ud800-\udbff\udc00-\udfff. Since \udbff\udc00 forms a valid surrogate pairs, it represent the code point U+10FC00.

Wrong solution

There is no point in writing:

"[\ud800-\udbff][\udc00-\udfff]"

Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.

Solution

If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:

input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");

This solution has been tested to work in Java 6 and 7 (Oracle implementation).

The regex above compiles to:

StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match

Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.

// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")

Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800 as one character and tries to compile the range \\udc00-\\udbff where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.


From Java 7 and above, the syntax \x{h..h} has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.

input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");

This regex also compiles to the same structure as above.

本文转自:http://stackoverflow.com/questions/27820971/why-a-surrogate-java-regexp-finds-hypen-minus

最新文章

  1. SWF READER 破解日志。
  2. NanoProfiler - 适合生产环境的性能监控类库 之 大数据篇
  3. javascript设计模式学习之十二——享元模式
  4. instancetype
  5. html中怎么去掉input获取焦点时候的边框
  6. 快速编写HTML,CSS代码的有力工具Emmet插件
  7. android标题栏(titlebar)显示进度条
  8. ant学习记录(复制-移动-删除-依赖综合测试)+fileset
  9. ios基础-编程规范
  10. Qt之日志输出文件
  11. javascript模拟title提示效果
  12. AM解调的FPGA实现
  13. 初始Windows程序
  14. python之os库
  15. ScheduledTheadPool线程池的使用
  16. 哪些intel 网卡支持SR-IOV
  17. typescript-koa-postgresql 实现一个简单的rest风格服务器 —— 连接 postgresql 数据库
  18. hdu 1754 I Hate It (线段树功能:单点更新和区间最值)
  19. c++ primer读书笔记之c++11(三)
  20. 编写一个函数,计算字符串中含有的不同字符的个数。字符在ACSII码范围内(0~127)。不在范围内的不作统计。

热门文章

  1. 【费用流】bzoj2661 [BeiJing wc2012]连连看
  2. 5.7(java学习笔记)Vector、Enumeration
  3. Problem I: 零起点学算法30——输出四位完全平方数
  4. gzip压缩目录
  5. asp.net 二级域名(路由方式实现)
  6. coco2dx jni 调用 java 相机返回 图片数据
  7. Swift,类的调用
  8. iOS:Masonry介绍与使用
  9. mongodb权限管理(转)
  10. Java调用Oracle存储过程