Gosh - I think I fell asleep... I tried to write about Test Data !!!
I am presently working on a project that requires some extensive testing. So what I hear you mutter, well this project requires data that vaies in it's quality.
I have verified that it works when given good data - but what happens when there is missing data..... and incorrect data.
I came across a good site which has an interesting data generator http://www.mockaroo.com - I need people type data.... and this site seems to be able to generate just what I need.
As I am mean (and my employer is meaner) I can only get data in 5K chunks. But I can run the command several times per day.....
By appending the output to a file - I can build up records quite quickly. But there is an issue due to the ID number being restarted.
If you use this little 1 line command - this fixes that issue.
cat basic.csv | cut -d ',' -f 2- | nl -n ln | sed -e 's/[[:space:]]\+/,/g' -e 's/^1,/id,/' > people.csv
The first 20 lines of the people.csv looks something like this....
id,first_name,last_name,email,country,ip_address,City,Country,CCard,Race,Company 2,Janice,Gilbert,email@example.com,Russia,188.8.131.52,Pyt-Yakh,Lebanon,3577598273371963,Melanesian,Yakijo 3,Mary,Adams,firstname.lastname@example.org,China,184.108.40.206,Lücheng,Portugal,3586021087796977,Black,or,African,American,Jamia 4,Frank,Lopez,email@example.com,Belarus,220.127.116.11,Dashkawka,United,States,5602233033286350,Pakistani,Vinte 5,Marie,Martin,firstname.lastname@example.org,Russia,18.104.22.168,Boguchar,China,5411850139268049,Blackfeet,Twimbo 6,Joyce,Price,email@example.com,Sweden,22.214.171.124,Horred,Mongolia,5100176874230168,Creek,Skippad 7,William,Riley,firstname.lastname@example.org,Colombia,126.96.36.199,Guapi,Indonesia,374622899402654,Navajo,Yadel 8,Rose,Cunningham,email@example.com,Colombia,188.8.131.52,Buenaventura,Indonesia,3581970954021361,Indonesian,Gigaclub 9,Beverly,Kim,firstname.lastname@example.org,China,184.108.40.206,Meilisi,Philippines,5602237467769419,Apache,Jatri 10,Lois,Coleman,email@example.com,China,220.127.116.11,Zhihe,Jamaica,5577254622341886,Seminole,Twimm
So where is the missing data ?
I do not just need missing/bad data - I also need good data (so I can test how accuratly the program is working).... so I thought - Lets make an obfuscator.... but one that I can control.
fuzz my fuzzy friend
This is my small module to "fuzz" the data in a controlled yet flexible manner.
__author__ = 'tim' from optparse import OptionParser import random parser = OptionParser() parser.add_option("-f", "--file", dest="filename", help="file to read", metavar="FILE") parser.add_option("-d", "--data", type=str, dest="fields", help="data fields to extract i.e. 2,3,4") parser.add_option("-l", "--loss", type=str, dest="loss", help="percentage of data to loose per output field, i.e. 50,20,90") (options, args) = parser.parse_args() loss = [int(a) for a in options.loss.split(',')] fields = [int(a) for a in options.fields.split(',')] first_line = True try: ifp = open(options.filename) for line in ifp: data_line = line.split(',') pos = 0 new_line = '' for wanted in fields: data = data_line[wanted -1] rnd = random.random() if pos > 0: new_line += "," if rnd*100 < loss[pos] or first_line: new_line += data pos += 1 print(""+new_line) first_line = False ifp.close() except FileNotFoundError: print("File %s not found", options.filename) pass
Using the people.csv file shown earlier.... if
You want the first 3 columns with 0% data loss
python -f people.csv -d 1,2,3 -l 0,0,0
You want fields 1,4,5 with 0% Loss, 50% loss and 70% loss
python -f people.csv -d 1,4,5 -l 0,50,70
Please feel free to use/mod/hack this code as you want. I may add some extra checks (number of fields is valid).
Write a scoring script .....