("file()'s problem with UTF-16" is wrong. This is updated.
The former may miss the last line of the string.)
file() seems to have a problem in handling
UTF-16 with or without BOM.
file() is likely to think "\n"=LF (0A) as a line-ending.
So, not only "000A" but also "010A, 020A,...,FE0A, FF0A,..."
are regarded as line-endings.
Moreover, file() causes a serious problem in UTF-16LE.
file() loses first "0A" (the first half of "0A00")!
And the next line begins with "00" (the rest of "0A00").
So lines after the first "0A" are totally different.
To avoid this phenomena,
eg. in case (php_script : UTF-8 , file : UTF-16 with line-ending "\r\n"),
<?php
mb_regex_encoding('UTF-16'); $str = file_get_contents($file_path);
$to_encoding = 'UTF-16'; $from_encoding = 'UTF-8'; $pattern1 = mb_convert_encoding('[^\r]*\r\n', $to_encoding, $from_encoding);
mb_ereg_search_init($str, $pattern1);
while ($res = mb_ereg_search_regs()) {
$file[] = $res[0];
}
$pattern2 = mb_convert_encoding('\A.*\r\n(.*)\z', $to_encoding, $from_encoding);
mb_ereg($pattern2, $str, $match);
$file[] = $match[1];
?>
instead of
$file = file($file_path);
If line-ending is "\n",
$pattern1 = mb_convert_encoding('[^\n]*\n', $to_encoding, $from_encoding);