Fuzzing data with Python

Test Data

Gosh - I think I fell asleep... I tried to write about Test Data !!!

I am presently working on a project that requires some extensive testing. So what I hear you mutter, well this project requires data that vaies in it's quality.

I have verified that it works when given good data - but what happens when there is missing data..... and incorrect data.

Helpful site

I came across a good site which has an interesting data generator http://www.mockaroo.com - I need people type data.... and this site seems to be able to generate just what I need.

As I am mean (and my employer is meaner) I can only get data in 5K chunks. But I can run the command several times per day.....

By appending the output to a file - I can build up records quite quickly. But there is an issue due to the ID number being restarted.

If you use this little 1 line command - this fixes that issue.

cat basic.csv | cut -d ',' -f 2- | nl -n ln  | sed -e 's/[[:space:]]\+/,/g' -e 's/^1,/id,/' > people.csv

The first 20 lines of the people.csv looks something like this....

id,first_name,last_name,email,country,ip_address,City,Country,CCard,Race,Company 2,Janice,Gilbert,jgilbert0@odnoklassniki.ru,Russia,200.90.119.195,Pyt-Yakh,Lebanon,3577598273371963,Melanesian,Yakijo 3,Mary,Adams,madams1@youtube.com,China,35.18.31.5,Lücheng,Portugal,3586021087796977,Black,or,African,American,Jamia 4,Frank,Lopez,flopez2@blog.com,Belarus,179.220.143.145,Dashkawka,United,States,5602233033286350,Pakistani,Vinte 5,Marie,Martin,mmartin3@va.gov,Russia,161.72.215.42,Boguchar,China,5411850139268049,Blackfeet,Twimbo 6,Joyce,Price,jprice4@artisteer.com,Sweden,105.57.32.169,Horred,Mongolia,5100176874230168,Creek,Skippad 7,William,Riley,wriley5@shutterfly.com,Colombia,138.135.110.89,Guapi,Indonesia,374622899402654,Navajo,Yadel 8,Rose,Cunningham,rcunningham6@friendfeed.com,Colombia,54.204.6.137,Buenaventura,Indonesia,3581970954021361,Indonesian,Gigaclub 9,Beverly,Kim,bkim7@twitter.com,China,181.185.247.178,Meilisi,Philippines,5602237467769419,Apache,Jatri 10,Lois,Coleman,lcoleman8@imageshack.us,China,152.211.220.240,Zhihe,Jamaica,5577254622341886,Seminole,Twimm

So where is the missing data ?

I do not just need missing/bad data - I also need good data (so I can test how accuratly the program is working).... so I thought - Lets make an obfuscator.... but one that I can control.

fuzz my fuzzy friend

This is my small module to "fuzz" the data in a controlled yet flexible manner.

__author__ = 'tim'

from optparse import OptionParser
import random

parser = OptionParser()
parser.add_option("-f", "--file", dest="filename",
                  help="file to read", metavar="FILE")
parser.add_option("-d", "--data", type=str, dest="fields", help="data fields to extract i.e. 2,3,4")
parser.add_option("-l", "--loss", type=str, dest="loss",
                  help="percentage of data to loose per output field, i.e. 50,20,90")
(options, args) = parser.parse_args()
loss = [int(a) for a in options.loss.split(',')]
fields = [int(a) for a in options.fields.split(',')]
first_line = True

try:
    ifp = open(options.filename)
    for line in ifp:
        data_line = line.split(',')
        pos = 0
        new_line = ''
        for wanted in fields:
            data = data_line[wanted -1]
            rnd = random.random()
            if pos > 0:
                new_line += ","
            if rnd*100 < loss[pos] or first_line:
                new_line += data
            pos += 1
        print(""+new_line)
        first_line = False
    ifp.close()
except FileNotFoundError:
    print("File %s not found", options.filename)
    pass

Usage

Using the people.csv file shown earlier.... if

You want the first 3 columns with 0% data loss

python -f people.csv -d 1,2,3 -l 0,0,0

You want fields 1,4,5 with 0% Loss, 50% loss and 70% loss

python -f people.csv -d 1,4,5 -l 0,50,70

Useage

Please feel free to use/mod/hack this code as you want. I may add some extra checks (number of fields is valid).

Next task....

Write a scoring script .....