ITLife365 生活互动学习空间 - 互动学习空间-xml 编码的可恨问题

xml 编码的可恨问题2012-03-07 19:16:31

明明XML 的格式都正确，但是程序解析的时候却报错，不能解析。特别是在java 以及j2ee 的项目开发中。由于xml 解析器的不同，导致问题来了。
在dos 下可以显示正常（使用type 查看也是异常的），在linux 下使用显示
<feff><?xml version="1.0" encoding="UTF-8"?>
具体如下图：

[itlife365@linux tempConf]$ file student.xml
student.xml: UTF-8 Unicode text, with CRLF line terminators
[itlife365@linux tempConf]$ file student.xml.bak_38
student.xml.bak_38: UTF-8 Unicode text, with CRLF, LF line terminators
[student.xml@linux96181 tempConf]$ file student.xml.bak_ok_zh_bj_
student.xml.bak_ok_zh_bj_tj: XML 1.0 document text
[itlife365@linux96181 tempConf]$

如果 XML 文档的实际编码、外部编码和内部编码（BOM 或 XML 声明）不一致，那么该文档就是不可读的。一个例外是外部编码为 Unicode（例如使用 UTF-16 的一个 Java String）：任何内部编码都被忽略。当一个不支持 XML 的过程译码（也就是改变实际编码）或者在不支持内部编码的情况下更改文档时，就会发生一个常见的问题。Java 语言、CLI 和嵌入式 SQL 应用程序中对字符串的某些处理可以在不改变内部编码的情况下进行译码。
为什么出现Unicode 编码
任何遗留编码都是有限制的，因为它只能表示少数语言中的文本。管理多种编码是件头痛的事情，这不仅仅是因为大多数应用程序和数据库只能处理一种编码，还有很多其它原因。Unicode 就是为了解决这个问题而发明的。它是用于表示正在使用的所有语言中的所有字符的字符集，并且留有增长空间。
表 1. Unicode 编码的 Byte Order Mark

BOM 类型	BOM 值	编码
UTF-8	X'EFBBBF'	UTF-8
UTF-16 Big Endian	X'FEFF'	UTF-16
UTF-16 Little Endian	X'FFFE'	UTF-16
UTF-32 Big Endian	X'0000FEFF'	UTF-32
UTF-32 Little Endian	X'FFFE0000'	UTF-32

上面的列表来自

http://www.ibm.com/developerworks/cn/education/data/db2-cert7333/section4.html
DB2 9 应用开发（733 考试）认证指南，第 3 部分: XML 数据操纵《XML 编码》
下面的内容来自：http://social.msdn.microsoft.com/Forums/zh-CN/Vsexpressvb/thread/6fbe8086-7950-43f4-a703-19491cb1d9f6

This is all about the differences in the encoding with different files, as discussed in the earlier thread. I've looked into it a little more and believe the following is correct.
If there is no BOM then the reader assumes it is UTF-8.
Your original file doesn't have a BOM but it uses UTF-16.
Depending upon how you edit the file it may get saved as UTF-8 (VS) or UTF-16 (Notepad).
It's all a bit of a mess really.

Private Function GetEncoding(ByVal Filename As String) As System.Text.Encoding

Dim Coding As System.Text.Encoding

Dim FS As New FileStream(Filename, FileMode.Open, FileAccess.Read)

Dim Header(1) As Byte

FS.Read(Header, 0, 2)

FS.Close()

Dim HeaderString As String = Hex(Header(0)) & Hex(Header(1))

Select Case HeaderString

Case "FFFE"

' Unicode UTF-16, little endian

Coding = System.Text.Encoding.Unicode

Case "FEFF"

' Unicode UTF-16, big endian

Coding = System.Text.Encoding.BigEndianUnicode

Case "EFBB"

' probably UTF-8

Coding = System.Text.Encoding.UTF8

Case Else ' No BOM (maybe)

If Header(0) = 0 Then

' probably big endian

Coding = System.Text.Encoding.BigEndianUnicode

ElseIf Header(1) = 0 Then

' probably little endian

Coding = System.Text.Encoding.Unicode

Else

' probably UTF-8 but I wouldn't put money on it

Coding = System.Text.Encoding.UTF8

End If

End Select

Return Coding

End Function

阅读全文

« 2025年1月 »
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

« 2025年1月 »

一

二

三

四

五

六

日

ITLife365 生活 互动学习空间 - 互动学习空间

Good Luck To You!

xml 编码的可恨问题2012-03-07 19:16:31

ITLife365 生活互动学习空间 - 互动学习空间