The New Cabalists (2nd ed.)

Preamble: The Cabalist Conundrum

As introduced Large language models (LLMs) like ChatGPT were meant to revisit the Cabalist project of computing knowledge from numbers supposedly inherent to letters; as revisited by generative models, massive corpus of documents and data sets are tokenised, morphed into neural networks and then into semantic ones serving as interpreters and editors. Their sudden (and unexpected) success with end users peripheral activities has been characterized by a flurry of boutiques pitching the alchemists confusion between numerical correlations and reasoning capabilities. Concomitantly, the wide and hasty attempts have put light on the intrinsic flaws of generative schemes: while machines can learn from facts they cannot learn from hearsay, whatever the volumes. It ensues that the momentum has turned towards outsourcing as much as possible of learning capabilities to external knowledge sources, bringing back the Artificial intelligence core issue: where and how to draw the line between language and knowledge; or more specifically the Cabalist conundrum: how to trade numbers for meanings.

Language & Communication

Direct vs Mediated Communication

Languages are meant to support two kinds of communication: conversational and mediated. The difference is critical because while conversational communication can get meanings from immediate context, that’s not possible for mediated communication which thus must rely on shared symbolic representations.

*Conversational & Mediated Communication*

Online search engines being an advertising cash cow, they have instantly become the locus of a fierce competition between big techs eager to defend (Google) or extend (Microsoft) the pastures under their guardianship. Given the potential disruption on barriers to entry, incumbents try to keep a blanket on their technology, swapping transparency for precautionary advices meant to prevent liabilities.

Compared to search engines, driven by business opportunism in open-ended contexts, editing tools are driven by clearly defined purposes set in bounded contexts. That transparency enables a wide range of knowledge-based supporting functions (tutoring, training, teaching, discovery, etc.), with significant momentum already observed for publishing and programming tools.

That parallel split with regard to function (search vs editing) and transparency (or lack thereof) reflects the paradox of generative languages which are meant to dealt with direct (context driven) as well as mediated (content driven) communication.

The Matter of Communication

Languages can be summarily characterized by the way they use grammars and words to associate terms (facts) to meanings (concepts):

Layered models deal separately with words (lexicon), structures (syntax), sentences (semantics), and contexts (pragmatics)
Semantic grammars combine syntax (roles) and semantics (references)
Generative models build meanings by applying machine learning algorithms to massive samples of actual texts

*Generative (solid, left) vs Grammar-based (dotted, right) Language Models*

Compared to layered and semantic grammars, which extract meanings through explicit categories (dotted line, right), generative languages (solid line, left) rely instead on implicit structures and meanings emerging from massive and comprehensive corpus of tokenised documents and datasets.

That comparison points to the main caveat of generative approaches: their reliance on conversational shortcuts between context and meanings (i.e. no mediation through explicit categories) induces an intrinsic lack of transparency regarding sources and reasoning, in other words the value of outcomes.

Value Chains

As means of exchange words, and more generally languages, can be compared to currencies, suggesting a parallel between the value of information as medium of exchange and store of value. And the relationship between information and money is not metaphoric:

With regard to context-driven (direct) communication, online disinformation and fake news on social networks are gouging the value of information, an echo of the Gresham’s law stating that bad money chase good one
With regard to content-driven (mediated) communication, training large language models can cost $20 million, with requests taking around tenfold the cost of a Google search
With regard to the storing of value the training of LLMs raises clear and inescapable issues about property rights

Hence the importance of clarifying the components of the value chains supported by large language models. From a functional perspective three kinds of processing can be combined:

Numerical, for the mapping of labelled tokens (terms) in neural networks into words in semantic ones
Semantic, for the assignment of meanings to words in bounded contexts
Pragmatic, for the final alignment of meanings with users intents

At the operational level technologies are of two kinds:

Retrieval Augmented Generation (RAG) uses information queries to target specific subsets of resources (documents and datasets)
Fine tuning (FT) uses prompts to focus conversations and enhance outcomes relevancy

From their respective initial remit, upstream for RAGs, downstream for FT, both approaches are continuously improved and progressively extended across LLMs value chain, typically through the embedding of intermediate outcomes. That could enable a degree of transparency regarding the respective returns of numeric, semantic, and pragmatic operations.

Paths & Parser

Knowledge Paths

By now the elephant in the generative room has made himself perfectly clear: while machines can learn from facts they cannot learn from hearsay, whatever the volumes. It ensues that R&D is now focused on outsourcing as much as possible of learning capabilities to external knowledge sources; which raises a dual philosophical and economic issue, respectively how to draw the line between language and knowledge, and how to price outsourcing solutions.

On that account modeling languages provide a good starting point considering they are meant to manage knowledge through an explicit and bounded combination of syntactic, lexical, and semantic rules. Models attachments to knowledge can be empiric and/or formal, the former through facts and terms (south path), the latter through concepts and meanings (north path). But since LLMs are supposed to bypass explicit categories, their only way to get a consistent access to knowledge is through a pairing of the north and south path.

To that end they need to sort out the terms’ syntactic and lexical roles (a), set apart words’ roles and semantic references (b), and manage meanings pragmatics (c). That can be achieved through an ontological parser.

Semantic Connectors

The primary objective of an ontological parser is to set apart language and knowledge contents. The proposed approach relies on semantics networks of stars (#) and planets (+), with pre-defined connectors representing semantics as gravitational forces:

OWL hierarchies (yellow lines) are used to organise concepts without bearing a particular meaning regarding inheritance.

*OWL hierarchies (yellow lines) vs Ontological Connectors (blue lines)*

Category Connectors

Ontological categories can be grammatical or conceptual.

Taking a clue from Role reference grammars (RRGs), an ontological parser must be able to deal with overlaps between semantics and grammatical categories:

Last but not least, conceptual categories can be ascertained through ontological modalities:

These connectors can serve as a Swiss Army Knife for Cabalists trying to cross the no-man’s land between numbers and knowledge.

Embeds & Trade

Besides its mystic origins, the outsourcing of knowledge processing to external agents can be examined along two congruent perspectives: cybernetics and economics.

From a cybernetic angle the question is to ascertain how information is exchanged, from an economic angle, it’s to characterise transaction costs and returns of investment.

Embeddings & Communication Channels

Vectors are the nuts and bolts of LLMs as they are used to represent numeric affinities between terms as well as semantic connexions between words; as such they can be used to embed meanings accrued across LLMs stages:

User’ intents (prompts)
Domain-specific concepts (meanings)
Thesaurus footprints (words)
Facts. patterns (terms)
Targeted contents(queries)

One step further, and assuming these embeds can be normalised, they could be turned into packaged units exchanged with external agents depending on communication channels: digital (terms), symbolic (queries, words, meanings), or natural (prompts).

Transaction Costs & Lean Processes

The concept of transaction costs, popularised by Nobel prize Ronald Coase, is about the assessment of outsourcing products or services considering their internal pricing. Applied to generative language models the concept takes a two-pronged relevance:

Cybernetics, regarding the flows of information and the balance of entropy between artificial and natural intelligence systems
Economics, regarding the returns on investment given the massive cost of building and running LLMs

While reliable yardsticks for entropy or returns on knowledge are clearly off the map, guidelines can be identified that will minimise both entropy and costs.

Regarding entropy, and taking for granted an ascending hierarchy between the information processing capabilities of digital, symbolic, and natural languages, no trade should induce a downgrading of information contents engendered by downward exchanges along the natural, symbolic, and digital hierarchy. Concomitantly, trades should aim at lean and just-in-time processes that would eliminate the waste of time and computing power:

Information (targets, symbolic) and knowledge (intents, natural) inputs flows should only impact corresponding embeddings
The weighting of terms (digital) should not be affected by meanings from within (embeddings) or without (words) LLMs

These guidelines can provide a starting point for economics of knowledge.

The New Cabalists (2nd ed.)

Preamble: The Cabalist Conundrum

Language & Communication

Direct vs Mediated Communication

The Matter of Communication

Value Chains

Paths & Parser

Knowledge Paths

Semantic Connectors

Category Connectors

Embeds & Trade

Embeddings & Communication Channels

Transaction Costs & Lean Processes

FURTHER READING

Kaleidoscope Series

Other Caminao References

Like this:

Preamble: The Cabalist Conundrum

Language & Communication

Direct vs Mediated Communication

The Matter of Communication

Value Chains

Paths & Parser

Knowledge Paths

Semantic Connectors

Category Connectors

Embeds & Trade

Embeddings & Communication Channels

Transaction Costs & Lean Processes

FURTHER READING

Kaleidoscope Series

Other Caminao References

Share this:

Like this: