Saturday, November 18, 2017

Teaching dinosaurs to dance digitally

It ain’t so much of #DigitalIndia yet. It's just PDF India. For those looking for meaningful, processible data, the difference between the two can make a world of difference. For some government institutions, digital means PDF or JPEG image of scanned print reports, including spreadsheets and balance sheets. One might argue that nowadays it's possible to extract data from PDFs using technology. But it's like garbage disposal and recycling. The messup shouldn't be created in the first place. Image processing algorithms were not invented for spreadsheets.

A good example of a bad PDF document is the District Census Handbook of 2011. It will have reams and reams of pages, with one page showing the first half of a horizontal spreadsheet, titled “Industrial Catgory of”, and next page showing the other half, titled “Marginal Workers”. Man, that must be some innovation in bad design .

Switch to the good example of presentation of the same data. presents hundreds of datasets from the Census of 2011 in open formats for public access. You can access the data for the same district in CSV format, loadable into Excel. Did you know you could download the Indian Railways timetable in Excel ?. How sweet!

The salient rule in data collection or presentation must be that, at the raw source, the data must be collected in a format that is processible by automation. It should minimize human eye intervention only for reviews and green/red flags and throwing up exception patterns. Or for discovering insights of wisdom from "rich experience", which a chip can't discern. Days may have come when chips outsmart elders in experiential wisdom as well, LoL.

Companies like , where my friend Ramnath is involved, address this problem by helping their customers and website visitors make visual sense out of the non-sense data in India's public domain. They also present it in a beautiful and user-friendly ways and help project stakeholders glean useful insights from data. That data shouldn't have been nonsensical in the first place, in its raw open form, is the sad fact. That they converted rants like these into a business opportunity is their ingenuity. While raw data must be disseminated in open processible formats, it should lead to an ecosystem of companies like these, which compete in discovering insights from data and presenting them, without having to spend too much time cleaning it. Not just cleaning it, but having to fight inconsistencies between multiple sources of data for the same item of information, I guess, must be another tedious task.

Taking a leaf out of international open data like, the open data platform for India is a big leap in this direction for sharing of public datasets (though they don’t have options to bulk download data). 

On the other hand, some of the transactional websites of the government websites can make life extremely tedious. If you are a high-volume transaction submitter, your life can become miserable, having to submit thousands of records into old-style web forms. Some of them must have been in a cave since AJAX was invented. They can put thousands of person names with option buttons on a single page, expecting the user to scroll down or use the browser's Find, choose one name and then submit. Wwwhaaat!

The format in which data must be submitted to government websites must be pre-defined with digital processing in mind. For a billion-headed Titanosaur like India, it should definitely have scale in mind too. Ideally, it shouldn't even be Windows OS intensive and Windows OS requiring. Kerala, for example, wants to dabble in Linux and it's a good thing. I hope they don't up give up like Munich, the city that wanted to run on Linux. But, that kind of thinking is good. It may lead at least to the adoption of open formats for seeking data, if not an open-source OS.

Not all is a sad story with Digital India. There are a handful of bright spots in good design, that take scale and digital processibility into account. Aadhaar, no doubt, is a beautiful example. The Income Tax website often has some sudden quirky differences between its Java tool and the Excel tool with mysterious conclusions of inability to generate the XML file. But it at least uses XML to upload data. Thats a good thing. Even within the Income Tax Department, you may not find the same kind of good design for other tasks, for instance, for applying for non-deduction of TDS. Another of my favorite examples of handling technology at scale carefully is SBI's transition to core banking and their merger with associate banks. It was not about open data or about government per se. But, at its scale, it's truly a project of teaching elephants to dance, and for their size, they did a mighty good job at it. The GSTN must be the next Aadhaar-like unifier, after the easing out of the initial troubles.

This is the intermission, the end of Part 1. 

In Part 2, I mention some of the woes I faced while extracting data from PDFs from the SEBI website. It has a sad ending that rounds up by saying :

Some data dinosaurs have to be taught to dance digitally, because evolution is binary. 
You either become distinct or you become extinct.  

Lot of technical debris ahead on Part 2. Blissful poets, musicians and other non-tech readers not allowed beyond this point. :-) :-)

-->> Part 2 

