C# – How to deal with ISO-2022-JP ( and other character sets ) in a Twitter update

cencodingtwitter

Part of my application accepts arbitrary text and posts it as an Update to Twitter. Everything works fine, until it comes to posting foreign ( non ASCII/UTF7/8 ) character sets, then things no longer work.

For example, if someone posts:
に投稿できる

It ( within my code in Visual Studio debugger ) becomes:
=?ISO-2022-JP?B?GyRCJEtFajlGJEckLSRrGyhC?=

Googling has told me that this represents ( minus ? as delimiters )

=?ISO-2022-JP is the text encoding
?B means it is base64 encoded
?GyRCJEtFajlGJEckLSRrGyhC? Is the encoded string

For the life of me, I can't figure out how to get this string posted as an update to Twitter in it's original Japanese characters. As it stands now, sending '=?ISO-2022-JP?B?GyRCJEtFajlGJEckLSRrGyhC?=' to Twitter will result in exactly that getting posted. Ive also tried breaking the string up into pieces as above, using System.Text.Encoding to convert to UTF8 from ISO-2022-JP and vice versa, base64 decoded and not. Additionally, ive played around with the URL Encoding of the status update like this:


string[] bits = tweetText.Split(new char[] { '?' });
if (bits.Length >= 4)
{
textEncoding = System.Text.Encoding.GetEncoding(bits[1]);
xml = oAuth.oAuthWebRequest(TwitterLibrary.oAuthTwitter.Method.POST, url, "status=" +   System.Web.HttpUtility.UrlEncode(decodedText, textEncoding)); 
}

No matter what I do, the results never end up back to normal.

EDIT:
Got it in the end. For those following at home, it was pretty close to the answer listed below in the end. It was just Visual Studios debugger was steering me the wrong way and a bug in the Twitter Library I was using. End result was this:


decodedText = textEncoding.GetString(System.Convert.FromBase64String(bits[3]));
byte[] originalBytes = textEncoding.GetBytes(decodedText);
byte[] utfBytes = System.Text.Encoding.Convert(textEncoding, System.Text.Encoding.UTF8, originalBytes);
// now, back to string form
decodedText = System.Text.Encoding.UTF8.GetString(utfBytes);

Thanks all.

Best Answer

This produced the output you are looking for:

using System;
using System.Text;

class Program {
  static void Main(string[] args) {
    string input = "に投稿できる";
    Console.WriteLine(EncodeTwit(input));
    Console.ReadLine();
  }
  public static string EncodeTwit(string txt) {
    var enc = Encoding.GetEncoding("iso-2022-jp");
    byte[] bytes = enc.GetBytes(txt);
    char[] chars = new char[(bytes.Length * 3 + 1) / 2];
    int len = Convert.ToBase64CharArray(bytes, 0, bytes.Length, chars, 0);
    return "=?ISO-2022-JP?B?" + new string(chars, 0, len) + "?=";
  }
}

Standards are great, there are so many to choose from. ISO never disappoints, there are no less than 3 ISO-2022-JP encodings. If you have trouble then also try encodings 50221 and 50222.

Related Topic