identity insights

II needs data in the form of UMF - UMF is an internal IBM format, which is 99% the same as XML.

Very simple XML...

<dogs> 
    <hound>
        <name>Daisy<\name>
    <\hound>
   <hound>
        <name>Benson<\name>
    <\hound>

</dogs>

This in UMF look like

<umf_entity><name>Daisy<\name<\umf_entity>
<umf_entity><name>Benson<\name><\umf_entity>

Converting XML to UMF

UMF as it is very similay to XML can be converted - in fact IBM has a tool for this called xutil

Example

This is an XML File of a Person, it has some fields in it that are specific to II (more on this later).

```xml<?xml version="1.0"?> NATIONAL_IDENTITYOMN-10000001725061-2-122261538AMRUBEL ADUDMروبال ادودDOB1985-05-10DOD1900-01-01OCCUPATIONعامل تنظيف / مباني عامةGENDERMCIVIL_ID1000000172014-10-162016-10-13PPBA09563480012014-04-212019-04-20NATIONAL_IDENTITYOMN-10000001725061-2-122261538AMRUBEL ADUDMروبال ادودDOB1985-05-10DOD1900-01-01OCCUPATIONعامل تنظيف / مباني عامةGENDERMCIVIL_ID1000000172014-10-162016-10-13PPBA09563480012014-04-212019-04-20

To convert this we use the xutil command 

    xutil xu -owide  -t person  < t.xml

This yields

```text
<PERSON><DSRC_CODE>NATIONAL_IDENTITY</DSRC_CODE><DSRC_ACCT>OMN-100000017</DSRC_ACCT><DSRC_REF>25061-2-122261538</DSRC_REF><DSRC_ACTION>A</DSRC_ACTION><NAME><NAME_TYPE>M</NAME_TYPE><FULL_NAME>RUBEL ADUD</FULL_NAME></NAME><NAME><NAME_TYPE>M</NAME_TYPE><FULL_NAME>روبال ادود</FULL_NAME></NAME><ATTRIBUTE><ATTR_TYPE>DOB</ATTR_TYPE><ATTR_VALUE>1985-05-10</ATTR_VALUE></ATTRIBUTE><ATTRIBUTE><ATTR_TYPE>DOD</ATTR_TYPE><ATTR_VALUE>1900-01-01</ATTR_VALUE></ATTRIBUTE><ATTRIBUTE><ATTR_TYPE>OCCUPATION</ATTR_TYPE><ATTR_VALUE>عامل تنظيف / مباني عامة</ATTR_VALUE></ATTRIBUTE><ATTRIBUTE><ATTR_TYPE>GENDER</ATTR_TYPE><ATTR_VALUE>M</ATTR_VALUE></ATTRIBUTE><NUMBER><NUM_TYPE>CIVIL_ID</NUM_TYPE><NUM_VALUE>100000017</NUM_VALUE><VALID_FROM_DT>2014-10-16</VALID_FROM_DT><VALID_THRU_DT>2016-10-13</VALID_THRU_DT></NUMBER><NUMBER><NUM_TYPE>PP</NUM_TYPE><NUM_VALUE>BA0956348</NUM_VALUE><NUM_LOCATION>001</NUM_LOCATION><VALID_FROM_DT>2014-04-21</VALID_FROM_DT><VALID_THRU_DT>2019-04-20</VALID_THRU_DT></NUMBER></PERSON>
<PERSON><DSRC_CODE>NATIONAL_IDENTITY</DSRC_CODE><DSRC_ACCT>OMN-100000017</DSRC_ACCT><DSRC_REF>25061-2-122261538</DSRC_REF><DSRC_ACTION>A</DSRC_ACTION><NAME><NAME_TYPE>M</NAME_TYPE><FULL_NAME>RUBEL ADUD</FULL_NAME></NAME><NAME><NAME_TYPE>M</NAME_TYPE><FULL_NAME>روبال ادود</FULL_NAME></NAME><ATTRIBUTE><ATTR_TYPE>DOB</ATTR_TYPE><ATTR_VALUE>1985-05-10</ATTR_VALUE></ATTRIBUTE><ATTRIBUTE><ATTR_TYPE>DOD</ATTR_TYPE><ATTR_VALUE>1900-01-01</ATTR_VALUE></ATTRIBUTE><ATTRIBUTE><ATTR_TYPE>OCCUPATION</ATTR_TYPE><ATTR_VALUE>عامل تنظيف / مباني عامة</ATTR_VALUE></ATTRIBUTE><ATTRIBUTE><ATTR_TYPE>GENDER</ATTR_TYPE><ATTR_VALUE>M</ATTR_VALUE></ATTRIBUTE><NUMBER><NUM_TYPE>CIVIL_ID</NUM_TYPE><NUM_VALUE>100000017</NUM_VALUE><VALID_FROM_DT>2014-10-16</VALID_FROM_DT><VALID_THRU_DT>2016-10-13</VALID_THRU_DT></NUMBER><NUMBER><NUM_TYPE>PP</NUM_TYPE><NUM_VALUE>BA0956348</NUM_VALUE><NUM_LOCATION>001</NUM_LOCATION><VALID_FROM_DT>2014-04-21</VALID_FROM_DT><VALID_THRU_DT>2019-04-20</VALID_THRU_DT></NUMBER></PERSON>

Note We have "lost" the PEOPLE schema - because I specificed -t PERSON , which I think means only take the PERSON stanza.

XML to UMF File

So putting the data into a file is simply a matter for redirection.

We however need to place an UMF_ENTITY on easy line

xu -o wide -t PERSON < t.xml | sed -e 's/PERSON/UMF_ENTITY/g' 

Data Storage

II stores the data in a database (usually DB2 but Oracle is also supported).

So as we are trying to develop some new rules - we need to clear out the data.

TRUNCATE TABLE SYSTEM_SEQUENCE IMMEDIATE;
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('DSE_LOG_ID', 1, 100);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('ADDRESS_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('ATTRIBUTE_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('EMAIL_ADDR_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('NAME_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('NUMS_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('ROLE_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('SEARCH_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('SEARCH_RESULT_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('ENTITY_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('NCOA_BATCH_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('DSRC_ACCT_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('ATTR_TYPE_ID', 110, 1);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('NUM_TYPE_ID', 103, 1);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('ER_HISTORY_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_EXCEPT_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_LOG_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_LOAD_GROUP_ID', 1, 5);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_LOAD_SUM_ID', 1, 5);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_SUM_DOCUMENT_ID', 1, 5);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_SUM_EXCEPTION_ID', 1, 5);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_SUM_MATCHLOG_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_SUM_QUALITY_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('UMF_SUM_RESOLUTION_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('FORCED_RESOLVE_ID_VALUE', 1, 1);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('DISCLOSED_RELATIONS_ID', 1, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('GEM_EVENT_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('GEM_EVENT_SITUATION_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('APP_ACTIVITY_HISTORY_ID', 10000, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('APP_INBOX_ID', 10000, 10);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('ACTIVITY_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('SEP_ROLES_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('SEP_RELATIONS_ID', 1, 1000);
INSERT INTO SYSTEM_SEQUENCE (SEQUENCE_NAME,NEXT_SEQUENCE,CACHE_SIZE) VALUES ('SEP_CONFLICT_ID', 1, 1000);
TRUNCATE TABLE DQM_GENERICS IMMEDIATE;
TRUNCATE TABLE ADDRESS IMMEDIATE;
TRUNCATE TABLE ATTRIBUTE IMMEDIATE;
TRUNCATE TABLE EMAIL_ADDR IMMEDIATE;
TRUNCATE TABLE NAME IMMEDIATE;
TRUNCATE TABLE NUMS IMMEDIATE;
TRUNCATE TABLE DISCLOSED_RELATIONS IMMEDIATE;
TRUNCATE TABLE UMF_EXCEPT IMMEDIATE;
TRUNCATE TABLE UMF_LOAD_GROUP IMMEDIATE;
TRUNCATE TABLE UMF_LOAD_SUM IMMEDIATE;
TRUNCATE TABLE UMF_LOG IMMEDIATE;
TRUNCATE TABLE SEARCH IMMEDIATE;
TRUNCATE TABLE SEARCH_RESULT IMMEDIATE;
TRUNCATE TABLE UMF_SUM_DOCUMENT IMMEDIATE;
TRUNCATE TABLE UMF_SUM_EXCEPTION IMMEDIATE;
TRUNCATE TABLE UMF_SUM_MATCHLOG IMMEDIATE;
TRUNCATE TABLE UMF_SUM_QUALITY IMMEDIATE;
TRUNCATE TABLE UMF_SUM_RESOLUTION IMMEDIATE;
TRUNCATE TABLE ENTITY IMMEDIATE;
TRUNCATE TABLE DSRC_ACCT IMMEDIATE;
TRUNCATE TABLE ER_ACCT_SCORE IMMEDIATE;
TRUNCATE TABLE ER_DETAIL IMMEDIATE;
TRUNCATE TABLE ER_ENTITY_SCORE IMMEDIATE;
TRUNCATE TABLE ER_ENTITY_STATE IMMEDIATE;
TRUNCATE TABLE ER_HISTORY IMMEDIATE;
TRUNCATE TABLE ER_RELOCATION IMMEDIATE;
TRUNCATE TABLE INCOMPLETE_RESOLVE IMMEDIATE;
TRUNCATE TABLE ER_FORCED_LOG IMMEDIATE;
TRUNCATE TABLE ROLE IMMEDIATE;
TRUNCATE TABLE SEP_RELATIONS IMMEDIATE;
TRUNCATE TABLE SEP_ROLES IMMEDIATE;
TRUNCATE TABLE SEP_CONFLICT IMMEDIATE;
TRUNCATE TABLE SEP_CONFLICT_REL IMMEDIATE;
COMMIT;

Python Code to Build Entries

This is my python builder as frankly working with XML and UMF are both very painful.

You need to be a little careful, as the UMF Generated needs to match the UMF Data Definitions inside the II Server.

If developing a new UMF imported, I suggest that you look at the Console UMF Error screen http://172.16.109.143:13510/console/UMFExceptionTab.do as this provides as more readable and understandable type of error message.

from nested_dict import nested_dict
from pprint import pprint

import collections



class person(object):

    def __init__(self):
        self.namecode = collections.namedtuple('namecode', 'SHORT FULL')
        self.attrs  = collections.namedtuple('attrs', 'DOB JOB GENDER')
        self.numbrss = collections.namedtuple('numbrss', 'CIVIL PASSPORT')

        self.data = nested_dict()
        self.data['DSRC_CODE']='NATIONAL_IDENTITY'
        self.data['DSRC_ACCT']='p'
        self.data['DSRC_REF']='ref'
        self.data['DSRC_TIM']='ref'
        self.data['DSRC_ACTION']='A'

        self.data['NAME'][0]['NAME_TYPE'] = 'M'
        self.data['NAME'][0]['FULL_NAME'] = ''
        self.data['NAME'][1]['NAME_TYPE'] = 'M'
        self.data['NAME'][1]['FULL_NAME'] = ''
        self.data['ATTRIBUTE'][0]['ATTR_TYPE'] = 'DOB'
        self.data['ATTRIBUTE'][0]['ATTR_VALUE'] = ''

        self.data['ATTRIBUTE'][1]['ATTR_TYPE'] = 'OCC'
        self.data['ATTRIBUTE'][1]['ATTR_VALUE'] = ''

        self.data['ATTRIBUTE'][2]['ATTR_TYPE'] = 'GENDER'
        self.data['ATTRIBUTE'][2]['ATTR_VALUE'] = ''

        self.data['NUMBER'][0]['NUM_TYPE'] = 'NAT_ID'
        self.data['NUMBER'][0]['NUM_VALUE'] = ''

        self.data['NUMBER'][1]['NUM_TYPE'] = 'PP'
        self.data['NUMBER'][1]['NUM_VALUE'] = ''
        self.data['NUMBER'][1]['NUM_LOCATION'] = ''

    def get_tuple_id(self,named_tupple, tup):
        id=0
        for fld in named_tupple._fields:
            if fld == tup:
                return id
            id += 1
        print("Key {} not found".format(tup))

    def rec_id(self,n):
        self.data['DSRC_ACCT']='p'+str(n)
        self.data['DSRC_REF']='ref'+str(n)


    def dump(self):
        for keys_as_tuple, value in sorted(self.data.items_flat()):
            print("{} {}".format(len(keys_as_tuple),keys_as_tuple[0]))

    def name(self,idx,value):
        self.data['NAME'][idx]['FULL_NAME']=value

    def attr(self,idx,value):
        i=self.get_tuple_id(self.attrs,idx)
        self.data['ATTRIBUTE'][i]['ATTR_VALUE']=value

    def num(self,idx,value):
        i=self.get_tuple_id(self.numbrss,idx)
        self.data['NUMBER'][i]['NUM_VALUE']=value

    def to_xml(self,):
        last_key_num=''
        last_key=''
        level_1_codes=["DSRC_CODE","DSRC_ACCT","DSRC_REF","DSRC_ACTION"]
        for k in level_1_codes:

            for keys_as_tuple, value in sorted(self.data.items_flat()):
                #if len(keys_as_tuple)==1 and keys_as_tuple[0]==k:
                if  keys_as_tuple[0] ==k:
                    print('<{0}>{1}</{0}>'.format(keys_as_tuple[0],value))
                    #print("Level1 key is "+keys_as_tuple[0])
                    break
                    #print("%-20s == %r" % (keys_as_tuple, value))


        for keys_as_tuple, value in sorted(self.data.items_flat()):
            if len(keys_as_tuple) == 3 and len(value):
                 key_num='{}{}'.format(keys_as_tuple[0],keys_as_tuple[1])
                 if last_key_num != key_num:
                     if len(last_key_num):
                         print('</{}>'.format(last_key))
                         print('<{}>'.format(keys_as_tuple[0]))
                         last_key_num=key_num
                         last_key=keys_as_tuple[0]
                     else:
                        print('<{}>'.format(keys_as_tuple[0]))
                        last_key=keys_as_tuple[0]
                        last_key_num=key_num
                 print("<{0}>{1}</{0}>".format (keys_as_tuple[2], value))
        print('</{}>'.format(last_key))

if __name__=="__main__":

    people=[]

    p = person()
    p.name(0,'TIM SEED')
    p.name(1, 'TIMOTHY SEED')
    p.attr("DOB",'1965/12/29')
    p.attr("GENDER", 'M')
    p.attr("JOB", 'Computer')
    p.num("CIVIL",'123123123')
    p.num("PASSPORT", 'P123123123')
    people.append(p)

    p = person()
    p.name(0,'TIM  SEED')
    p.name(1, 'TIMOTHY SEED')
    p.attr("DOB",'1965/12/29')
    p.attr("GENDER", 'M')
    p.attr("JOB", 'Computer')
    p.num("CIVIL",'666666444')
    p.num("PASSPORT", 'P666666999')
    people.append(p)

    p = person()
    p.name(0,'J  SEED')
    p.name(1, 'Juliet SEED')
    p.attr("DOB",'1971/2/27')
    p.attr("GENDER", 'F')
    p.attr("JOB", 'Admin')
    p.num("CIVIL",'123123444')
    p.num("PASSPORT", 'P123123999')
    people.append(p)


    #i need to make this an XML File
    print("<PEOPLE>")
    n=0
    for p in people:
        print("<PERSON>")
        p.rec_id(n)
        n+=1
        p.to_xml()
        print("</PERSON>")
    print("</PEOPLE>")



#people[0].dump()

To generate the data

python per_det.py > t.xml

To convert and load

rm test*
/home/db2inst1/IBM/ISII/bin/xutil -o wide -t PERSON < /media/sf_MacDoc/t.xml | sed -e 's/PERSON/UMF_ENTITY/g' > t.umf
./pipeline -d -c pipeline.ini -n test -f t.umf

you should see something like this

09/23 14:19:36 [test:559101728] NOTE: Finished Processing [file:/home/db2inst1/tmp/t.umf] :
09/23 14:19:36 [test:559101728] NOTE:  Total Records Processed      [3]
09/23 14:19:36 [test:559101728] NOTE:  Records this Run             [3]
09/23 14:19:36 [test:559101728] NOTE:    Processed                  [3]
09/23 14:19:36 [test:559101728] NOTE:    Processed w/ Info          [0]
09/23 14:19:36 [test:559101728] NOTE:    Unhandled                  [0]
09/23 14:19:36 [test:559101728] NOTE:    Empty Lines                [0]
09/23 14:19:36 [test:559101728] NOTE:    Bad XML                    [0]
09/23 14:19:36 [test:559101728] NOTE:    (retries)                  [0]
09/23 14:19:36 [test:559101728] DBUG: Nodes stopped successfully.
09/23 14:19:36 [test:559101728] DBUG: Populating cache for sequence ['UMF_LOAD_SUM_ID']
09/23 14:19:36 [test:559101728] DBUG: Populating cache for sequence ['UMF_SUM_DOCUMENT_ID']
09/23 14:19:36 [test:559101728] DBUG: Populating cache for sequence ['UMF_SUM_MATCHLOG_ID']
09/23 14:19:36 [test:559101728] DBUG: Populating cache for sequence ['UMF_SUM_RESOLUTION_ID']
09/23 14:19:36 [test:559101728] DBUG: Populating cache for sequence ['UMF_SUM_QUALITY_ID']

3 records loaded - no issues with the data.

Entity Resolution

Entity resolution is when 1 or more entity of the SAME TYPE i.e. Immigration, CIVIL, WATCHLIST etc matches other Entities of the same type. By resolving an Entity we mean that the entities are combined.

It does not mean that an Entity from Immigration will resolve with an Entity from Civil.

To do this you need to define a Relationship

A Relationship (Resolving part1)

This is when 2 data records - from 2 different entities appear to have similaraties... However just to make things more complex - this is done with a ROLE

A Role

This is assigned to an Entity - so if you have a CIVIL entity create a CIVIL_ROLE, IMMIGRATION should also have an IMMIGRATION_ROLE.

Now back to resolving entities

Resolving part 2

With Each Entity having a Role, we now can specify which Roles to Combine together.

IMMIGRATION_ROLE AND CIVIL_ROLE

They are very similar data Records.

But BUS_ROLE and IMMIGRATION_ROLE may not have any common attributes - so they can not be used to combine/resolve Entities.