About

Project Akshara

Hi, my name is Bhuvan, and this is my passion project.

I love reading. In one form or another, I’ve read for as long as I can remember. If I can confidently claim to know a thing or two, or even anything at all, it is because pretty much everything that has happened to me revolves around words and pages. I owe a deep gratitude to these magical squiggly lines that have given me so much.

Somewhere along my reading journey, I came across the concept of the public domain. At the time, I was too dumb to immediately realise its importance. It took me a few years before it slowly started to sink in: just how extraordinary the idea really is. The fact that something like 99% of the written wisdom of our world is available for free, out of copyright, on the internet is not an easy thing to wrap one’s head around. For me, it was like a slow-release drug. But once it finally clicked, it fundamentally changed how I think about knowledge, books, writing, and what it means to read and write.

Around the same time, I came across Project Gutenberg. I think it was when I discovered Blaise Pascal’s Pensées, available for free in a clean and readable format. That was when it hit me just how bloody awesome that website is.

And then one question began to bother me deeply: why isn’t there a Project Gutenberg for India?

There is no shortage of jingoistic rhetoric about the richness of Indian civilisation: how it spans thousands of years, how we have supposedly given the world everything. Here I can’t resist making the plastic surgery and flying planes joke. We constantly beat our chests about being a great nation, the oldest civilisation. And yet, we have done a fucking piss-poor job of preserving our written heritage.

Given the sheer amount of wisdom accumulated over centuries, much of it is still not easily accessible to anyone genuinely interested in learning about India.

I felt a strong urge to do something about it. The problem was that I’m not a technical person. I’m a coder with a K, not a C. I didn’t even know where to begin.

Then large language models entered our lives. At first, I thought of it like a bad smell wafting in through an open window. Okay, I’m joking. But suddenly, even idiots like me had access to coding tools like Claude Code and Cursor. I started using them to build small websites.

I couldn’t figure out how to build even a terrible, small-scale version of Project Gutenberg. But then an idea popped into my head: if I couldn’t build a site where all these texts were readable, I could at least build a simple directory. A collection of links to Indian literary and historical works on Project Gutenberg, Archive.org, and other places where noble souls had already done the hard work of digitising these old texts.

That project became Dhwani. Dhwani as in the Kannada word for sound.

But while browsing Archive.org, I kept coming across so many extraordinary works across multiple languages, rich with historical and cultural detail. The urge to do something more kept growing.

A couple of colleagues introduced me to OCR, optical character recognition. I started experimenting with Tesseract, but the output was poor. Then, by accident, I came across a tweet announcing a new OCR model from Baidu. And luckily, because the Chinese are generous, they had a free API tier that allowed OCR for up to 3,000 pages a day.

Baidu’s Paddle OCR turned out to be ridiculously good at extracting text from old, horrible, unreadable PDFs. All of a sudden, 70–80% accurate text extraction was feasible.

And so, like any self-respecting Indian, I created three accounts to scam Baidu and started running OCR on my laptop through the night.

But the error rates were still high. Large chunks of text couldn’t be extracted. Tables were a mess. For the first four or five books, I extracted text using Paddle OCR, then manually copy-pasted and reassembled chunks in Gemini’s AI Studio. I did this for a couple of thousand pages. Very quickly, my eyes started hurting. Then Kailash suggested using LLM APIs directly, things like Gemini. Suddenly, accuracy shot up to 90–95%. The downside was hallucinations, poor formatting, and other issues, but still much better than manual grunt work.

After K showed me how to use LLM APIs to extract text, he jokingly said: “That’s one small step for man, one giant leap for an a**hole like you.” It was funny, but also true.

That’s essentially the backstory of this project.

Right now, I’m doing this alone. I’d love to turn this into a volunteer effort someday.

The cheaper API models are reasonably okay, but the costs still add up. Ideally, I’d love to use the costlier models like Sonnet and Gemini Pro, but the API costs are prohibitively expensive. Hopefully, they become cheaper over time, so I’ll be able to publish more books here.

I don’t know where this will go, but I’m genuinely excited. There are so many amazing books trapped inside horribly scanned PDFs, just waiting to be read.

You might ask: why do this? Initially, I told myself it would help scholars and researchers. But if I’m honest, that’s not the real reason. The simple reason is this: do I think this should exist? Yes. And that’s it.

Whether anyone visits the site, whether people find it useful, I honestly don’t know. Any other justification feels like a way to masturbate my own ego.

At a very high level, this site is a love letter to the public domain. There is an insane amount of wealth and wisdom freely available there. And we Indians have done a horrible job of protecting our public knowledge commons. I just want to contribute, in a very small way, to making that better.

I don’t harbour grand illusions that I’ll change everything. I’d simply rather do something than nothing.

Another reason comes from a podcast by Venkatesh Rao. He beautifully articulated something I had been thinking about for a while:

Every word I write is probably pregnant with 15 references I don’t recognize that owe to cultural heritage, and maybe one or two I do; I cannot even begin to acknowledge all of them. The right attitude towards it is: maybe I’m adding something personal here, but fundamentally I’m working with an ocean of commons. I am taking a little bit of value out of it, hopefully putting back more value than I take out, taking just enough to sustain myself, but in general not pretending that I own even my part of it.

That framing stuck with me. It’s pretty much how I think about reading and writing: you take from the common pool of wisdom, and you try to add back a little more than you take out.

We live in a moment where books are being banned, authors are being kicked out, atavistic tendencies are on the rise, and hope often feels thin. There are attempts, across the world, to rewrite history itself.

A while ago, I read Salman Rushdie’s Knife, and in the book, he recounts remarks he had given at an event:

We are engaged in a world war of stories, a war between incompatible versions of reality, and we need to learn how to fight it.

“World war of stories.” What a line. It hit me like a ton of bricks, and I keep thinking about it. It’s a powerful metaphor that captures the essence of our current moment.

One grandiose hope, if you the discerning and judgmental reader will allow me, is that this site acts as a small antidote: a bulwark against people’s fetish for rewriting history to suit their own idiotic purposes. I hope that making our history and culture readable, and preserving it in some form, contributes towards that larger cause.

The final reason, one that will fully expose my spectacular lack of IQ, is simply this: I think words are truly amazing.

Just think about what words actually are. Random sequences of squiggly lines on a piece of paper or a screen. And yet they can unleash profound visions, wild reveries, throw you into deep spirals of contemplation, and hold up a mirror showing you who you truly are, and who we truly are.

That is magical.

None of this would have been possible without my friend Joice and the great Kailash, who showed me how to use these LLM APIs.

As I’ve watched these models progress, I genuinely believe they’ll unlock new use cases around text, especially text extracted through this project. I can imagine AI-assisted books, strange magazines, experiments grounded in historical source material. Those are things I hope to do someday.

But for now, this is a humble start.

Why Akshara?

Akshara was the first word that popped into my head. It’s a Kannada word for “letter.” But when I looked it up on alar.ink, I discovered it also means eternal.

ಅಕ್ಷರ (akṣara)
adjective: undecaying; imperishable; not degenerating; eternal.

noun: a letter of the alphabet; a syllable; a sign representing a sound of speech; that which is written; hand-writing; a written message.

I can’t think of a better name than that.

Words are truly eternal.

Want to get involved? Get in touch.