A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
This is how the UTF-8 encoding would work:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Given an array of integers representing the data, return whether it is a valid utf-8 encoding.
Solution :
Count number of bytes and validate one by one, check for corner cases like number of bytes 1 or more than 5.
to understand UTF-8 : https://www.youtube.com/watch?v=MijmeoH9LT4
public class Solution {
public boolean validUtf8(int[] data) {
if(data == null || data.length == 0){
return true;
}
int i = 0;
int len = data.length;
int noOfBytes = 0;
while(i < len){
noOfBytes = byteLength(data[i]);
if(noOfBytes == 1 || noOfBytes >= 5)
return false;
if(noOfBytes > 0)
noOfBytes--;
while(noOfBytes > 0){
i++;
if(i >= len){
return false;
}
if(!isValidContinuation(data[i]))
return false;
noOfBytes--;
}
i++;
}
if(noOfBytes > 0)
return false;
else
return true;
}
public int byteLength(int number){
int count = 0;
while(true){
if((number & 128) > 0){
count++;
number = number << 1;
}else{
break;
}
}
return count;
}
public boolean isValidContinuation(int number){
if(((number & 128) > 0) && ((number << 1) & 128) == 0)
return true;
else
return false;
}
}