我在網上看到有文章說可以很簡單的使用Java的Input Stream來開啟Big5 HKSCS編碼的網頁,然後直接轉成UTF-8。
I’ve seen some articles in internet saying that it’s quite easy to use Java’s Input Stream to open a webpage encrypted in Big5 HKSCS and convert it directly to UTF-8.
我將那些網頁介紹的方法,寫了一個到香港政府某網頁的程式,試試看轉出的內容是否UTF-8。
According to the method introduced, I’ve written a java program to test if it can download a Hong Kong Government’s webpage and display the content in UTF-8.
這個程式的內容如下:
The source code is as following:
import java.net.URL;
import java.net.URLConnection;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Building{
public static void main(String args[]) throws Exception{
URL bdgSite=new URL("https://bmis.buildingmgt.gov.hk/chi/building.php?count=0ªield=&district_id=0");
StringBuffer s=new StringBuffer();
String inputLine;
URLConnection fc=bdgSite.openConnection();
BufferedReader in=new BufferedReader(new InputStreamReader(fc.getInputStream(),"Big5_HKSCS"));
while((inputLine=in.readLine())!=null)
s.append(inputLine);
in.close();
Pattern p1=Pattern.compile("<option[^>]+>([^<]+)</option>");
Matcher m1=p1.matcher(s);
while(m1.find()){
System.out.println(m1.group(1));
}
}
}
在程式中,我試圖將十八區的名稱找下來。(尤其是深水埗的「埗」字,這個若用普通的Big5編碼是看不到的。)
In the program, I tried to use regular expression to find the 18 regions in Chinese, one of the word “埗” cannot be processed properly if using Big5 encryption.
我在mac機的Terminal用UTF-8輸出,看到亂碼,用BIG 5 HKSCS輸出,看到正常。這表示,Java只能正常地讀取網頁,卻沒有將網頁的內容轉成Unicode或UTF-8。
I use mac os and set the output of terminal into UTF-8. I found the output is messy and is in wrong encoding. Then I switch the output of terminal to BIG 5 HKSCS, I found the output become normal. This shows that Java can process the encoding of the web page properly but, it doesn’t convert it into Unicode or UTF-8 for processing. Thus, the output is still in BIG 5 HKSCS.
我將這程式安裝在Ubuntu 13.04, JDK 1.7.0_21的環境運行卻發現正常,最後,我找到了Mac OS的Java 設定中,Default file encoding 是並不是UTF-8而很可能是ISO-8859-1。而要正常運行,必需要加入參數 -Dfile.encoding=UTF-8
I’ve tried to install this program in one of my Ubuntu desktop 13.04 with JDK 1.7.0_21. The program runs properly. Finally, I found out a fact that there is a setting “-Dfile.encoding” which affect Mac OS to output the result properly. In order to fix the problem, I need to add an option “-Dfile.encoding=UTF-8”:
java -Dfile.encoding=UTF-8 Bulding
Read Full Post »