umf

UMF stands for Universal Media * Format

It is a rather stupid IBM term, why stupid ? Because only IBM uses this format. It is a modified version of XML.

Used for

UMF is used to send data records to II (Identity Insights), and it is this use case that we are going to look at UMF.

Incoming Data

The example data set that I initially will show has come from the Phone_Ownership data set (Note: This is rather a virtual dataset as there is nothing 100% that looks like this dataset).

CSV Incoming

The data is coming in a CSV format and looks like this.

Serial_number,Phone_Number,PROOF_OF_ID,DOC_ID,NATIONALITY
2473,86-(774)602-7676,I,2925809041280280000,Japan

Now depending how you wish to create data from this you can interpret the data as follows.

  • Serial Number - Has to be Unique
  • Phone Number - Should be Unique
  • Proof_Of_ID - Not unique i.e. ID Card, Passport
  • DOC_ID - Not unique - may be duplicated across countries
  • Nationality - Not unique

The combination however of ProofOfId DOC_ID Nationality should be unique.

UMF Version of this

The UMF version of this data looks like this - it is 1 line. And so does not display very nicely.

<UMF_ENTITY><DSRC_CODE>PH_REGISTER</DSRC_CODE><DSRC_ACCT>2473_2925809041280280000</DSRC_ACCT><DSRC_REF>2473_2925809041280280000</DSRC_REF><DSRC_ACTION>A</DSRC_ACTION><ENTITY_TYPE>PHONE</ENTITY_TYPE><NAME><NAME_TYPE>M</NAME_TYPE><FULL_NAME>86-(774)602-7676</FULL_NAME></NAME><NUMBER><NUM_TYPE>PH</NUM_TYPE><NUM_VALUE>86-(774)602-7676</NUM_VALUE></NUMBER><ATTRIBUTE><ATTR_TYPE>EIA_COMMON_KEY</ATTR_TYPE><ATTR_VALUE>2473_86-(774)602-7676</ATTR_VALUE></ATTRIBUTE><NUMBER><NUM_TYPE>NAT_ID</NUM_TYPE><NUM_VALUE>2925809041280280000</NUM_VALUE></NUMBER></UMF_ENTITY>

Reformatting to look like XML it looks like this

<UMF_ENTITY>
    <DSRC_CODE>PH_REGISTER</DSRC_CODE>
    <DSRC_ACCT>2473_2925809041280280000</DSRC_ACCT>
    <DSRC_REF>2473_2925809041280280000</DSRC_REF>
    <DSRC_ACTION>A</DSRC_ACTION>
    <ENTITY_TYPE>PHONE</ENTITY_TYPE>
    <NAME>
         <NAME_TYPE>M</NAME_TYPE>
        <FULL_NAME>86-(774)602-7676</FULL_NAME>
    </NAME>
    <NUMBER>
        <NUM_TYPE>PH</NUM_TYPE>
        <NUM_VALUE>86-(774)602-7676</NUM_VALUE>
    </NUMBER>
    <ATTRIBUTE>
        <ATTR_TYPE>EIA_COMMON_KEY</ATTR_TYPE>
        <ATTR_VALUE>2473_86-(774)602-7676</ATTR_VALUE>
    </ATTRIBUTE>
    <NUMBER>
        <NUM_TYPE>NAT_ID</NUM_TYPE>
        <NUM_VALUE>2925809041280280000</NUM_VALUE>
    </NUMBER>
</UMF_ENTITY>

Which is MUCH easier to read.

Looking at the XML Version (which has only been re-indented) You can see 2 clear parts

  • Header
  • Data

UMF internals

Each UMF record need to be identified to a

  • Header
    • Source - DSRC_CODE
    • Have Unique Record ID - DSRC_ACCT
    • Data Operation - DSRC_ACTION

The Data is a little more complex - with new fields as well as attribute info

In UMF terms some items have pre-built rules

  • Name
  • Id

Items can be indicated if they are Unique/Attributes but of course UMF calls this Number/Attribute

So the UMF item

<NUMBER>
    <NUM_TYPE>NAT_ID</NUM_TYPE>
    <NUM_VALUE>2925809041280280000</NUM_VALUE>
</NUMBER>

Equates to

This is a Unique ID, it is a NAT_ID (I am guessing this means something to II also)..

Known Fields

In the UMF code - have a close look at the Name root.

<NAME>
    <NAME_TYPE>M</NAME_TYPE>
    <FULL_NAME>86-(774)602-7676</FULL_NAME>
</NAME>

Here we have another example of some internal II code.... NAME_TYPE this is to control how II processes the Name (It has some internal built in processing for Names already).

CSV to UMF

This is not intergrated (from what I can tell) to I2 in terms of schema building.

Remember that I2 and II need to share the same data (specifically they need to share the Unique Document ID - DSRC_REF) II however needs to process the data it's own way.

Where the complexity in this processing comes is not converting the CSV to XML like - that is simple. It is deciding what should be an attribute or a number.

This is something that Pandas may be able to help with, although any form of select distinct from

would be applicable.