It
ain’t so much of #DigitalIndia yet. It's just PDF India. For those looking for meaningful, processible data,
the difference between the two can make a world of difference. For some
government institutions, digital means PDF or JPEG image of scanned print reports,
including spreadsheets and balance sheets. One might argue that nowadays it's
possible to extract data from PDFs using technology. But it's like garbage
disposal and recycling. The messup shouldn't be created in the first place. Image
processing algorithms were not invented for spreadsheets.
A
good example of a bad PDF document is the District Census Handbook of 2011. It
will have reams and reams of pages, with one page showing the first half of a
horizontal spreadsheet, titled “Industrial Catgory of”, and next page showing
the other half, titled “Marginal Workers”. Man, that must be some innovation in
bad design .
Switch
to the good example of presentation of the same data. data.gov.in presents hundreds of
datasets from the Census of 2011 in open formats for public access. You can
access the data for the same district in CSV format, loadable into Excel. Did you know you could download the Indian Railways timetable in Excel ?. How sweet!
The
salient rule in data collection or presentation must be that, at the raw
source, the data must be collected in a format that is processible by
automation. It should minimize human eye intervention only for reviews and
green/red flags and throwing up exception patterns. Or for discovering insights
of wisdom from "rich experience", which a chip can't discern. Days
may have come when chips outsmart elders in experiential wisdom as well,
LoL.
Companies
like HowIndiaLives.com , where my friend Ramnath is involved, address this
problem by helping their customers and website visitors make visual sense out
of the non-sense data in India 's
public domain. They also present it in a beautiful and user-friendly ways and
help project stakeholders glean useful insights from data. That data shouldn't
have been nonsensical in the first place, in its raw open form, is the sad
fact. That they converted rants like these into a business opportunity is their
ingenuity. While raw data must be disseminated in open processible formats, it
should lead to an ecosystem of companies like these, which compete in
discovering insights from data and presenting them, without having to spend too
much time cleaning it. Not just cleaning it, but having to fight
inconsistencies between multiple sources of data for the same item of
information, I guess, must be another tedious task.
Taking
a leaf out of international open data like data.un.org, the open data platform
for India
is a big leap in this direction for sharing of public datasets (though they
don’t have options to bulk download data).
On the other hand, some of the
transactional websites of the government websites can make life extremely tedious.
If you are a high-volume transaction submitter, your life can become miserable,
having to submit thousands of records into old-style web forms. Some of them
must have been in a cave since AJAX
was invented. They can put thousands of person names with option buttons on a
single page, expecting the user to scroll down or use the browser's Find, choose one name and then submit. Wwwhaaat!
The
format in which data must be submitted to government websites must be
pre-defined with digital processing in mind. For a billion-headed Titanosaur
like India ,
it should definitely have scale in mind too. Ideally, it shouldn't even be
Windows OS intensive and Windows OS requiring. Kerala, for example, wants to
dabble in Linux and it's a good thing. I hope they don't up give up like Munich , the city that wanted to run on Linux.
But, that kind of thinking is good. It may lead at least to the adoption of
open formats for seeking data, if not an open-source OS.
Not
all is a sad story with Digital India. There are a handful of bright spots in good design, that
take scale and digital processibility into account. Aadhaar, no doubt, is a beautiful
example. The Income Tax website often has some sudden quirky differences
between its Java tool and the Excel tool with mysterious conclusions of inability to generate the XML file. But it at least uses XML to upload data.
Thats a good thing. Even within the Income Tax Department, you may not find the
same kind of good design for other tasks, for instance, for applying for
non-deduction of TDS. Another of my favorite examples of handling technology at
scale carefully is SBI's transition to core banking and their merger with
associate banks. It was not about open data or about government per se. But, at
its scale, it's truly a project of teaching elephants to dance, and for their
size, they did a mighty good job at it. The GSTN must be the next Aadhaar-like
unifier, after the easing out of the initial troubles.
This is the intermission, the end of Part 1.
In Part 2, I mention some of the woes I faced while extracting data from PDFs from the SEBI website. It has a sad ending that rounds up by saying :
Some data dinosaurs have to be taught to dance digitally, because evolution is binary.
You either become distinct or you become extinct.
Lot of technical debris ahead on Part 2. Blissful poets, musicians and other non-tech readers not allowed beyond this point. :-) :-)
-->> Part 2
This is the intermission, the end of Part 1.
In Part 2, I mention some of the woes I faced while extracting data from PDFs from the SEBI website. It has a sad ending that rounds up by saying :
Some data dinosaurs have to be taught to dance digitally, because evolution is binary.
You either become distinct or you become extinct.
Lot of technical debris ahead on Part 2. Blissful poets, musicians and other non-tech readers not allowed beyond this point. :-) :-)
-->> Part 2
No comments:
Post a Comment
Thank you for your comments....