Your initial observation is very accurate: the "brain" of an AI is a complex structure, not simply an accumulation of "data." The parameters of a model represent learned patterns and relationships, not the data itself .6.
Regarding your central question about what percentage of data is hosted locally, the technical answer is: all "stored" data (the parameters) is hosted locally, but the AI uses external data in real time for a wide range of queries.
There is no fixed "percentage" of data from outside because AI is not a database . It is a reasoning machine that decides, question by question, whether it needs to search for external information.
AI models do not store data like a search engine
To understand this better, let's look at the fundamental difference between a search engine like Google and a language model like DeepSeek or ChatGPT:
Google: It acts as a gigantic index of the web. It stores and catalogs billions of web pages and, in response to your query, returns links to the data it has already indexed on its own servers.
A conversational AI: Its internal "knowledge" is not raw data, but parameters (the statistical patterns it learned during its training) -9This knowledge is static and has a cutoff date. Therefore, for anything requiring updated or highly specific information, the model needs to activate an external search .10.
How often and why do they need to search the Internet?
A study on ChatGPT reveals very telling data about this behavior -2-5:
Frequency: ChatGPT performs an internet search in approximately 31% of the queries it receives. That is, in almost one out of every three questions, the AI decides that its "brain" is not sufficient and needs external help .2-5.
Types of queries: 59% of searches with local intent (such as "the best Italian restaurant near me") trigger a search, as do 41% of shopping-related queries -2-5.
What are they looking for? When they go online, they're usually looking for reviews, comparisons, or markedly recent information, such as "the best electric cars of 2026" -2.
The technology behind the search: RAG
The technique that allows this connection between the internal "brain" and the outside world is called Recovery Augmented Generation (RAG) -6-9.
Instead of having all the data indexed locally like Google, the model performs one or more searches on external sources (web search engines, private databases, etc.), reads the results it considers relevant, extracts the key information, and integrates it in real time with its own reasoning to generate an answer for you .3-6Data sources can be very varied: from company websites (which represent 58% of sources in local searches) to Wikipedia (39% of mentions) -8.
In short, the fundamental knowledge and "intelligence" always reside within the model's parameters (hosted locally). But its ability to search for and process external information in real time is what makes it such a powerful tool for current or specific queries.
I hope this explanation clarifies your question. Would you like us to delve deeper into any specific aspect, such as RAG technology or the difference between RAG and traditional search engines?