Wednesday, August 24, 2011

Java / XML UTF-8 marshalling and validating pains

This is a post on the little pains I am sure most people have endured while working with Java and XML from different sources, with 'funny' characters...

If you use a FileWriter instead of a FileOutputStream it will use your OS encoding (So cp1252, ISO-8859-1 or US-ASCII and not UTF-8... ):

A way to ensure that you are using UTF-8 is:

If you are reading XML from a stream, it is probably safer to do something like:

When validating a schema note the .getBytes("UTF-8"):

When marshalling (Use a stream rather than a writer):

Lastly if you are getting XML from some unknown source there are some chars that are outside the legal XML Unicode limits that you can not encode. To remove those:
Code below from another blog :


Popular Posts