Before we could begin our Data Migration, you’ll need to clearly understand what is UTF8 is the 8-bit encoding of Unicode?
It is a variable-width encoding and a strict superset of ASCII. This means that each and every character in the ASCII character set is available in UTF-8 with the same code point values. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding.
Oracle’s support for Unicode
Oracle started supporting Unicode as a database character set in version 7.Unicode characters can be stored in an Oracle database in two ways.
The end user can create a Unicode database that enables you to store UTF-8 encoded characters as SQL CHAR datatypes (CHAR, VARCHAR2, CLOB, and LONG). For this you have to specify the database character set as UTF-8 when creating the database.
SQL Server support for Unicode
SQL Server started supporting Unicode as database character set from version 7. It uses UCS-2 character encoding for storing Unicode data.
Now let’s move on to how to Migrate Oracle database with UTF8 data to SQL Server using SSMA.
Unicode database solution
If database is Unicode database, then Oracle is configured to store multi-byte strings in VARCHAR2 / CHAR columns or variables. In this case, you can use customized mappings to map the character types to Unicode types in SQL Server.
SQL Server 2005/2008
Otherwise, non-ASCII strings can be distorted during data migration
Unicode data type solution
For Unicode datatype solution, source strings declared as national (NVARCHAR2 and NCHAR) are automatically mapped to nvarchar and nchar. Large object types like nclob is automatically mapped to nvarchar(max). So there is no need to change the default mappings for the above datatypes.
Customizing Data Type Mappings
You can customize type mappings at the project level, object category level (such as all tables), or object level. Settings are inherited from the higher level unless they are overridden at a lower level. For example, if you map smallmoney to money at the project level, all objects in the project will use this mapping unless you customize the mapping at the object or category level.
The following procedure shows how to map data types at the project, database, or object level:
To map data types follow the below steps:
To customize data type mapping for the whole project, open the Project Settings dialog box:
a. On the Tools menu, select Project Settings.
b. In the left pane, select Type Mapping. The type mapping chart and buttons appear in the right pane.
Or, to customize data type mapping at the database, table, view, or stored procedure level, select the database, object category, or object in Oracle Metadata Explorer:
c. In Oracle Metadata Explorer, select the folder or object to customize.
d. In the right pane, click the Type Mapping tab.
Author : Ajay(MSFT), SQL Developer Engineer, Microsoft
Reviewed by : Smat (MSFT), SQL Escalation Services, Microsoft
I work at Microsoft CSS and I've had quite bit of an issue with the Data Migration from Oracle to SQL Azure. I was hammering the ends to finish my Migration Demo Projects. Many Many thanks for creating this subtle yet precise informative article for the Data Migration Folks out here on the field. Thank you Mr A!
SSMA has converted my (Unicode) VARCHAR2(N) to NVARCHAR(4xN). What I want is the same as Oracle, i.e. if I have a field that allows up to 10 Unicode characters (VARCHAR2(10)). Isn't that NVARCHAR(10)? Or in SQL Server can I only specify the length in bytes? Which would mean no real control. My NVARCHAR(40) could have 40 single byte characters or only 10 4-byte characters.
This is by design.
Conversion from Oracle's NVARCHAR2 -> SQL NVARCHAR doubles the length because, in SQL Server, NVARCHAR is a count of bytes rather than a count of characters. In the case of UTF-16, we have to account for surrogate pairs (a second 2 bytes of character data whose presence is indicated by the first two bytes.
I don't believe NVARCHAR length is the count of bytes. Try this:
CREATE TABLE tmp_unicode (col1 NVARCHAR(6));
INSERT INTO tmp_unicode (col1) VALUES (N'ердийн');
SELECT col1, LEN(col1), DATALENGTH(col1) FROM tmp_unicode;
DROP TABLE tmp_unicode;
Prior to Oracle 11, nvarchar2 column lengths are measured in characters. SQL Server nvarchar columns are (and always have been) measured in bytes.
We must account for the possibility of surrogate pairs (see en.wikipedia.org/.../Surrogate_pair) of up to two bytes length in some of the various UTF encodings.
The only way we could prevent this is to actually scan the data at schema conversion time to determine there are no surrogate pairs present.
I don't understand. You said "SQL Server nvarchar columns are (and always have been) measured in bytes". My SQL demonstrates that is not the case - it inserts 12 bytes in a NVARCHAR(6) field. Can you explain that please.
Can you explain what happens in case of surrogate pairs?