Problem(Abstract)

When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.

Symptom

Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.

For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.

Diagnosing the problem

When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.

We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.

[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace

Resolving the problem

Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:

-Dibm.stream.nio=true

I am getting a MalformedInputException. How can I resolve this?

This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.

You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:

# echo $LANG
en_US.UTF-8

Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:

# export LANG=en_US

Alternatively, you can add this environment variable from the administration console.

MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.

Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.

The JVM can be forced to use NIO if the JVM argument is used as stated above.

Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.

最新文章

  1. springboot(八):RabbitMQ详解
  2. 1Z0-053 争议题目解析687
  3. Tesla P4 在深度学习上的性价比辗压目前所有量产的FPGA
  4. cf378C(模拟)
  5. .NET/C# 使用Stopwatch测量运行时间
  6. 几种常见的排序方法(C语言实现)
  7. 我给女朋友讲编程html系列(1) -- Html快速入门
  8. USACO全部测试数据
  9. 【转】Ubuntu乱码解决方案(全)
  10. [Bootstap] 9. Dropdown
  11. 通过SPList Definition自定义ListItem打开编辑详细页面
  12. Android 连接 SQL Server (jtds方式)——下
  13. js实现放大效果
  14. vue.js-moment的使用
  15. python 中 __name__ 的使用
  16. JDK5.0 特性-线程同步装置之Semaphore
  17. 05 数据库入门学习-正则表达式、用户管理、pymysql模块
  18. JDK1.10+scala环境的搭建之windows环境
  19. uva 701 - The Archeologists' Dilemma
  20. 帧布局--FrameLayout

热门文章

  1. Java 系列之spring学习--springmvc注解方式(五)
  2. JQuery学习系列篇(一)
  3. div position:fixed后,水平居中的问题
  4. gvim74 提示报错 “无法加载库python27.dll”
  5. vue <router-view>没有渲染
  6. div纵向居中的方法(转载)
  7. 洛谷 P1029 最大公约数和最小公倍数问题
  8. unity 获取UGUI中的Text字的坐标
  9. 前后端分离开发,跨域访问的apche设置
  10. angular-事件