need to convert files to unicode (/u****)

hi,
I have some files written in korean ,chinese but i can’t use them till i convert them to unicode.
Unicode here means like converting $ to /u0026.
I am using java and the parser here accepts unicode (/u****).
Is there some tool, script available for the same

Hi,
if you know the charset used in the files, you could read them with something like this :



java.io.BufferedReader readfile = 
  new java.io.BufferedReader (
     new java.io.InputStreamReader(
        new java.io.FileInputStream(myfile),"ISO8859_1"));

(Here I presume the initial charset is the windows charset ISO8859_1.)

You can then use a BufferedWriter to write the chars out in UTF-8 or whatever.
HTH

Lenwolf

wormulove wrote:

>
> hi,
> I have some files written in korean ,chinese but i can’t use them till
> i convert them to unicode.
> Unicode here means like converting $ to /u0026.
> I am using java and the parser here accepts unicode (/u****).
> Is there some tool, script available for the same
>
>
Is this what you are looking for (you may need to combine it with the
suggestion from lenwolf for initial reading of the file)
http://www.xinotes.org/notes/note/812/
Java: convert UTF-8 to unicode escape string


PC: oS 11.4 64 bit | Intel Core i7-2600@3.40GHz | KDE 4.6.0 | GeForce GT 420
| 16GB Ram
Eee PC 1201n: oS 11.4 64 bit | Intel Atom 330@1.60GHz | KDE 4.7.2 | nVidia
ION | 3GB Ram

hi,
Thank you guys.
the link given by martin uses lenwolf concept.It looks great so far, need to test more.

however the logic used to make unicode is still not very clear to me

for (int i = 0; i < s.length(); i++) {
	    char c = s.charAt(i);
	    if ((c >> 7) > 0) {
		sb.append("\\u");
		sb.append(hexChar(c >> 12) & 0xF]); // append the hex character for the left-most 4-bits
		sb.append(hexChar(c >> 8) & 0xF]);  // hex for the second group of 4-bits from the left
		sb.append(hexChar(c >> 4) & 0xF]);  // hex for the third group
		sb.append(hexChar[c & 0xF]);         // hex for the last group, e.g., the right most 4-bits
	    }
	    else {
		sb.append(c);
	    }
	}

but the things are looking fine

Hi, actually, I’m not sure that this is necessary.

AFAIK, at least as of Java 1.5 (or Java 5), Java handles strings internally as UTF-16.

(See, eg. String (Java 2 Platform SE 5.0)).

SO, provided you can get the right charset when opening the file as I mentioned above, all you would need would be to read in the strings from the file and write them to an output file.

 
 String temp="";
 java.io.BufferedReader readfile = 
    new java.io.BufferedReader (
      new java.io.InputStreamReader(
         new java.io.FileInputStream(inputFile),"ISO8859_1"));

 java.io.BufferedWriter sout = 
     new  java.io.BufferedWriter(
        new  java.io.OutputStreamWriter( 
             new  java.io.FileOutputStream(new java.io.File (outputFile)), "UTF-8"));
           

  while ((temp = readfiles.readLine()) != null)   
  {
       sout.write(temp);
       sout.newLine():
  }
  //close input & output etc

HTH

Lenwolf

lenwolf wrote:

>
> Hi, actually, I’m not sure that this is necessary.
>
> AFAIK, at least as of Java 1.5 (or Java 5), Java handles strings
> internally as UTF-16.
>
> (See, eg. ‘String (Java 2 Platform SE 5.0)’
> (http://tinyurl.com/3qpbn8f)).
>
> SO, provided you can get the right charset when opening the file as I
> mentioned above, all you would need would be to read in the strings from
> the file and write them to an output file.
>
But reading the original description of what the OP wants is the escape
sequences not a utf-8 file, just a bit weird described (conversion to plain
utf-8 in fact is trivial and does not need java at all but can also be done
with iconv from the command line).


PC: oS 11.4 64 bit | Intel Core i7-2600@3.40GHz | KDE 4.6.0 | GeForce GT 420
| 16GB Ram
Eee PC 1201n: oS 11.4 64 bit | Intel Atom 330@1.60GHz | KDE 4.7.2 | nVidia
ION | 3GB Ram

The reason why i gave a weird explanation of my need was when i search/ask hot to convert to unicode the answer i get as are “save as unicode in text editor” or save as type “utf-8” in java but that’s not the requirement here, here i want o convert languages to the form /udddd as these languages cannot be represented in any other form.whenever saving any of there document we have to ensure the encoding type is unicode or else they will be corrupted.
don’t know if it can be achieved from that api
so that algo is very much required I suppose

If you save a file to a format with \uxxxx in it you do exactly not save it
as unicode, you save it as a plain ascii file with escape sequences, that is
what the piece of code I pointed you to does.
You understand the difference?
If your requirement is to save the file as utf-8 you will NEVER see
something like \uxxxx in it (unless it is accidentially part of some text
about unicode escape sequences of course).
You got the code for both use cases, so just find out what you really need
to do, saving as utf-8 or saving as ansi with utf-8 escape sequences, we
cannot know that.


PC: oS 11.4 64 bit | Intel Core i7-2600@3.40GHz | KDE 4.6.0 | GeForce GT 420
| 16GB Ram
Eee PC 1201n: oS 11.4 64 bit | Intel Atom 330@1.60GHz | KDE 4.7.2 | nVidia
ION | 3GB Ram

Hi,

Ok, apparently I misunderstood.

Just to be clear:

let’s say you have an input file with only the “é” character in it, and it’s coded in the standard Windows code page. That is a file with one byte in it.
You want to convert that into a file having the characters “\uFFFD” in it, i.e. the standard UTF representation for the “é” character. That would be a file with 6 bytes in it. Is that what you want?

Then the code in Martin_Helm’s link does exactly that (if you amend it to send the bytes to a file instead of the console). You just must make sure that you open the fileinputstream with the correct charset!

HTH

Lenwolf

hi,
with all your help so far i have written the code till here

import java.io.*;

/**
 * Reads file in UTF-8 encoding and output to STDOUT in ASCII with unicode
 * escaped sequence for characters outside of ASCII.
 */
public class UTF8ToAscii {
	public static void main(String] args) throws Exception {
		// if (args.length < 1)
		java.io.BufferedWriter sout = new java.io.BufferedWriter(
				new java.io.OutputStreamWriter(new java.io.FileOutputStream(
						new java.io.File("resources/outputFile")), "ISO8859_1"));
		String line = "hi";
		{

			// Directory path here
			String path = "resources";
			String] fl = new String[999];
			int c = 0;

			String files;
			File folder = new File(path);
			File] listOfFiles = folder.listFiles();

			for (int i = 0; i < listOfFiles.length; i++) {

				if (listOfFiles*.isFile()) {
					files = listOfFiles*.getName();
					if (files.endsWith(".rc")) {
						fl[c] = files;
						System.out.println(fl[c] + "" + c);
						// System.out.println(files.getAbsolutePath());
						ListFiles ls = new ListFiles();
						{

							String Str = "resources/" + fl[c];

							BufferedReader r = new BufferedReader(
									new InputStreamReader(new FileInputStream(
											Str), "UTF-8"));

							line = r.readLine();
							while (line != null) {
								System.out.println(unicodeEscape(line));
								line = r.readLine();

								sout.write(line);
							}
							r.close();
						}
						c++;
					}
				}
			}

		}

	}

	private static final char] hexChar = { '0', '1', '2', '3', '4', '5', '6',
			'7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };

	private static String unicodeEscape(String s) {
		StringBuilder sb = new StringBuilder();
		for (int i = 0; i < s.length(); i++) {
			char c = s.charAt(i);
			if ((c >> 7) > 0) {
				sb.append("\\u");
				sb.append(hexChar(c >> 12) & 0xF]); // append the hex character
														// for the left-most
														// 4-bits
				sb.append(hexChar(c >> 8) & 0xF]); // hex for the second group
													// of 4-bits from the left
				sb.append(hexChar(c >> 4) & 0xF]); // hex for the third group
				sb.append(hexChar[c & 0xF]); // hex for the last group, e.g.,
												// the right most 4-bits
			} else {
				sb.append(c);
			}
		}
		return sb.toString();
	}

}

however i am getting exception here

Exception in thread "main" java.lang.NullPointerException
	at java.io.Writer.write(Writer.java:157)
	at UTF8ToAscii.main(UTF8ToAscii.java:47)

the output file is modified on each run but don’t get any text. However the SOP is successfully printing the unicode values
but not able to write it .Encoding type I have changed to ISO8859_1. what can be the error cause here.?

Thankx**

changing the writer encoding type to utf-8 also has no effect and the same error is thrown.
Is this has to do something with compiler. I am using openjdk 1.6

Hi,

no for some reason your output file doesn’t get opened correctly.

Try, for testing purposes, to use a hard filename, eg “c:/textoutput” or “/tmp/testoutput” or somesuch.

HTH

Lenwolf

I am trying on a hard file name only.
If this what you mean

java.io.BufferedWriter sout = new java.io.BufferedWriter(
				new java.io.OutputStreamWriter(new java.io.FileOutputStream(
						new java.io.File("resources/outputFile")), "ISO8859_1"));

And you must have seen that output file is written as eclipse says that the file has changed, after running the program.
Not able to figure out the actual cause here.

Thankx

Hi,

the problem lies in this part of the code:

                   

                                                        line = r.readLine();
							while (line != null) {
								System.out.println(unicodeEscape(line));
								line = r.readLine();
								sout.write(line);

You read another line with r.readline() just before you write it and after you printedthe older line out with System.out. But the line is null at the end of the file! Hence the error.
Either explicitly test for this:

                   

                                                        line = r.readLine();
							while (line != null) {
								System.out.println(unicodeEscape(line));
								line = r.readLine();
                                                                if (line!=null)
								    sout.write(line);

or better still make that part of the while condition:

                   

							while ((line == r.readline())!= null) {
								System.out.println(unicodeEscape(line));
								
								sout.write(line);
}

HTH

Lenwolf

those checks I did already. Here I am making some conceptual mistake and not able to figure it out.

line = r.readLine();
							while (line != null) {
								System.out.println(unicodeEscape(line));
								line = r.readLine();
								sout.write(line);

this check completes the program as line is always null. But the target is not achieved that is to write.

No you didn’t!!!
Please read again what I wrote above.

You make TWO r.readline() calls, but only check for null in the first one, not the one in the loop!!!
Use one of the two solutions I gave above.
Lenwolf


                                                        line = r.readLine();
							while (line != null) {
								System.out.println(unicodeEscape(line));
								line = r.readLine();
                                                                if (line!=null)
								    sout.write(line);

As line value is always null , so the program runs fine but no write is done.


							while ((line == r.readline())!= null) {
								System.out.println(unicodeEscape(line));
								
								sout.write(line);
}

Don’t know why but it says no readline method method define for buffer reader.

yes, as I said the last line “read” is always null,that’s how you know you reached the end of the file.

It is r.readLine() (capital L) and not r.readline().

Lenwolf

import java.io.*;

/**
 * Reads file in UTF-8 encoding and output to STDOUT in ASCII with unicode
 * escaped sequence for characters outside of ASCII.
 */
public class UTF8ToAscii {
	public static void main(String] args) throws Exception {
		// if (args.length < 1)

		String line = "hi";
		{

			// Directory path here
			String path = "resources";
			String] fl = new String[999];
			int c = 0;

			String files;
			File folder = new File(path);
			File] listOfFiles = folder.listFiles();

			for (int i = 0; i < listOfFiles.length; i++) {

				if (listOfFiles*.isFile()) {
					files = listOfFiles*.getName();
					if (files.endsWith(".rc")) {
						fl[c] = files;
						System.out.println(fl[c] + "" + c);
						// System.out.println(files.getAbsolutePath());
						ListFiles ls = new ListFiles();
						{

							String Str = "resources/" + fl[c];

							BufferedReader r = new BufferedReader(
									new InputStreamReader(new FileInputStream(
											Str), "UTF-8"));

							BufferedWriter sout = new BufferedWriter(
									new FileWriter("opt.txt"));

							while ((line = r.readLine()) != null) {

								String n1 = unicodeEscape(line);
								sout.append(n1);
								sout.append(System
										.getProperty("line.separator"));
								// System.out.println(unicodeEscape(line));
								
							}
							
							r.close();
							sout.close();
						}
						c++;
					}
				}
			}

		}

	}

	private static final char] hexChar = { '0', '1', '2', '3', '4', '5', '6',
			'7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };

	private static String unicodeEscape(String s) {
		StringBuilder sb = new StringBuilder();
		for (int i = 0; i < s.length(); i++) {
			char c = s.charAt(i);
			if ((c >> 7) > 0) {
				sb.append("\\u");
				sb.append(hexChar(c >> 12) & 0xF]); // append the hex character
														// for the left-most
														// 4-bits
				sb.append(hexChar(c >> 8) & 0xF]); // hex for the second group
													// of 4-bits from the left
				sb.append(hexChar(c >> 4) & 0xF]); // hex for the third group
				sb.append(hexChar[c & 0xF]); // hex for the last group, e.g.,
												// the right most 4-bits
			} else {
				sb.append(c);
			}
		}
		return sb.toString();
	}

}

Got it working with this :stuck_out_tongue:
still analyzing the cause.did i needed to append?

Thankx**

Strange i completely rewrote it but anyways now working fine