You are throwing a party. It is a big one. You need to create a list of 200+ guests that need to go to your organizers, planners, printers, registries, and all other behind-the-screen magicians who make these types of events happen. Your contacts are all over the place - some in your personal email, some in the work one, and some in that semi-organized contacts app on your phone. You start with the easiest - email. You bulk-paste emails into excel. But they are unformatted... All you want is a clean spreadsheet with name, last name, and email address. Is it too much to ask? Didn't they promise that computers in the future would be able to read your mind? If you know your way around Excel, you will either do advanced acrobatics of split column and a combo of "search & replace". Perhaps you are an advanced user and can create something like **LEFT(e_address,FIND("separator",)-1)**
. If you don't know how to Excel your only option is to spend about 60 minutes manually retyping things. But if you can code in one of the computer languages you can use regex!
With regex, things like this take 3 lines of code (depending on which language you are using) and it will take you under 0.03 seconds to run it (depending on which machine you are running it on). Magic. /^(\w+)\..*\.(\w+)@/
. Regex is meant for problems like this - it is a swiss knife for slaying and manipulating text data. In a nutshell, regular expressions, or regex, in short, are set of symbols to find patterns in strings. However, this definition hugely understates the universality and utility of regex. Regex is language agnostic1, so it can be used across codebases, disciplines, and domains. Regex is almost* universal (*there are multiple flavors) since most of the popular programming languages have either a regex library or have it even built right into the language. And the applications are endless: verify whether input fits into the text pattern, find text that matches the pattern within a larger body of text, to replace text matching the pattern with other text or rearranged bits of the matched text, split a block of text into a list of subtexts, and shoot yourself in the foot.
But like with everything in life, there is a catch. Regexes are hard. They are hard to read. They are hard to write. They are hard to document. They are also hard to master: while the rules of a regex are finite and straight-forward, without useful applications learning regex rules is like learning latin for modern diplomacy. And even if you once become extremely proficient in regex, but don't use them often, you will have to look them up again every time. So yes, it will take you 0.03 seconds to run it, but it can take you good 30 minutes to first figure out what the regular expression should be.
For all their simple construction, regular expressions have a metamorphic history. Neuroscientist Warren S. McCulloch and logician Walter Pitts worked on logical calculus models to describe how the human nervous system works2. Mathematician Stephen Kleene extended these models with an algebra notation that he called regular sets/regular expressions 3. A computer scientist Ken Thompson implemented the idea of regular expressions inside the text editor, ‘ed’ to deal with mundane tasks. The result was almost magical - the editor allowed the users to use “wildcard” matching pretty much anywhere in the operating system that text search is required 4 The last boost that brought regular expressions to the commoners was given by Larry Wall when he made regex a core feature of his text-oriented programming language, Perl. In fact regex made PERL a “duct tape” of 1990s web development5. Today regular expressions can be easily found as part of the core libraries across most programming languages. But no matter how many times you wrote regex and how experienced of the developer are you, the unfortunate fact is, that if you don't use regex often you end up re-learning regular expressions over and over again.
Yet for the first time in the history of computer programming writing regex might not be a chore. A new wave of AI tools come to the rescue to let regular humans use machine learning to generate regex from summary definition. While tooling to make regex easier is nothing new (there is no shortage of regex generators and aided regex constructores)6 what is new and different this time is that tools like autoregex.xyz allow you to type in plain English what you want: extract first and last name from email address
or replace celsius with farenheit
and get straigtforward translation into regex7. Who would have thought that in the age of the overhyped AI capabilities promising everything from curing cancer to helping you babysit your kids, a headline that you will likely see is "Machine Learning will help you conquer regex".
AI-nization of regex is one of the instance of the new wave of computer-aided code writing. In 2021 Github has released Co-Pilot - an AI pair programmer tool that turns natural language prompts into coding suggestions. In 2022 AWS annouced Code Whisperer - ML-powered coding companion (similar thing but in AWS). Both tools let developers type in plain english the definition of what the code is suppose to do and the tool reccomends a code snippet that can tackle the description. Both tools sit on top of branch of machine learning - Large Language Models (LLM). LLM are "trained" on large text-based datasets and trained models can recognise written requests and generate things like articles and dialogue. And while they have shocked and awed us over the last couple of years, in my opinion computer-aided programming, including notorious autoregex, is probably one of the most pragmatic AI use cases out there. Some might point out that translating plain English to code is the essence of programmers' job and hence AI-code-whisperer might diminish programmer's thinking ability just like Google Maps destroyed our navigation abilities9. However for years classic computer science books have argued that the best way to write clear code is to declare what it is supposed to do first in plain English and encouraged "paper & pen programming". Hence typing instructions to our co-pilots to achieve clarity of code is probably what Uncle Bob8 wanted us to do all along.
"We make our tools and thereafter they shape us" famously noted Marshall McLuhan. He was an academic celebrity who shared a lot of thoughts on how media shapes our environment, our outlook and eventually us. (his other famous quote is "the medium is the message"). He argued when we introduce a new medium in society - it changes how we feel, how we relate to the world and in the end, changes how we behave and think. We, humans, are tool builders: we build things that make our life safer, easier, and more pleasant. For a really long time the work of programming computers required patience, attention, ability to retain mental models, systematically debug programs when they go wrong and creatively translate tasks to machine language. When StackOverflow10 came along it changed programming, letting developers around the world solve problems, seek snippets of code for reuse, improve their own code, and discuss technical concepts and freeing our overloaded memory box. StackOverflow flipped the industry from "remembering" to "being skilled in asking questions and searching for answers"11. Did they change how programmers think in code? Maybe, but StackOverflow defintely made programming more accessible to more people.
The AI code generation tools mark the next milestone in professional evolution of code writing. Our concerns around these tools are from the standard Man-vs-Machine theme. If the mighty regex going to fall to automation, is it what is going to happen with other parts of the job of programmers? It is a bit ironic that one of the first jobs becoming AI-automated is the profession that introduced automation in the first place. And while my AI-pair programmer can eventually take my job, I feel a bit of relief thinking I would never have to construct a regex again and finally get a comprehensive guest list to the planners.
[1] almost, as scripting languages tend to have their own regular expression flavor built-in)
[2] year is 1946 and the their idea looks like: The neuron allows only the binary states i.e., ‘0’s and ‘1’s. so it is called as a binary activated neuron. These neurons are connected by direct weighted path. The neuron fires if the net input to the neuron is greater than the Threshold. The threshold is set, so that the inhibition is absolute, because the non-zero inhibitory output will prevent the neuron from firing. The connected path can be excitatory or inhibitory. Y = Mc Culloch – Pitts Neuron which can receive signal from any other neurons. W = Weights of the neuron. Weights are Excitatory when positive and Inhibitory when negative. The Mc Culloch – Pitts Neuronhas an activation function f(Y) = 1, if Yin>= ϴ = 0, if Yin< ϴ.
[3] year is 1956 and the expression looks like By a regular expression, we shall mean a particular way of expressing a regular set of tables starting with single-table sets and applying zero or more times the three operations (passing from E and F to E ∨ F, EF or E∗F).
↩
[4] year is 1968 and the expression looks like g/pinky/p i.e.(g/regular expression/p)
where g and p are modifiers where g was telling the editor to search for the word through out the document and p was to print the results to the screen; global regular expression print, in short now we are calling it as grep.
[5] year is 1980 and the expression looks like $foo =~ m/fee|fie|foe|fum/
[6] There are variety of tools that let one experiment with regex: https://regex-generator.olafneumann.org/ | regexr.com lets you paste text and point which parts you want and then generate the actual regex expression
[7] For curious the responses from autoregex.xyz are ([a-zA-Z]+)@([a-zA-Z]+)\.([a-zA-Z]+)
[8] Uncle Bob is - Robert Cecil Martin (colloquially known as Uncle Bob) is an American software engineer and author. He is a co-author of the Agile Manifesto. He is famous for writing the book Clean Code about keeping code manageable.
[9] Numerous books like Wayfinding: The Science and Mystery of How Humans Navigate the World, by M.R. O’Connor , Pinpoint: How GPS Is Changing Technology, Culture, and Our Minds by Greg Milner and Never Lost Again: The Google Mapping Revolution That Sparked New Industries and Augmented Our Reality by Bill Killday attempt to measure the impact of GPS in our life, showing significant impact on our spatial memory
[10] StackOverflow is a question-and-answer website for professional and enthusiastic programmers
[11] There is This anonymously published manual "Copying and Pasting from Stack Overflow" is the quintessence of software development techniques. Mastering this art will not only make you the most desired developer in the market, but it will transform the craziest deadline into "Consider it done, Sir" https://www.goodreads.com/book/show/29437996-copying-and-pasting-from-stack-overflow