← Notes Fundamentals

Entity Signals: Building a Verifiable Identity Without Wikidata

In short Most professionals and organisations do not have and cannot have a Wikidata entry. That does not prevent them from building a semantic identity readable by AI systems. The signals that matter are elsewhere: JSON-LD markup on their own website, official profiles declared as verifiable anchors, sector-specific identifiers, an llms.txt file that instructs models directly. Each of these is a piece of data an AI system can use to resolve entity linking without having to invent. The work is not about writing more content — it is about making what already exists verifiable.

The starting point: why Wikidata is not the answer for everyone

When discussing semantic identity and entity linking, Wikidata is often cited as the solution. That is understandable: it is the largest and most open knowledge graph on the web, used by Google, the major AI systems and dozens of search engines as a source of verifiable facts. Having an entry there means being a node in the graph machines consult.

The problem is that Wikidata requires notability. It is not an arbitrary formal requirement but a precise editorial choice: entries must describe entities that have received significant coverage from independent sources. A lawyer with twenty years of experience, an established consultant in their field, a nonprofit with projects funded by the European Union — none of these automatically clears that bar. An entry created without meeting the criteria gets removed. The effect is counterproductive, and in some cases leaves a negative trace.

And yet the problem to solve is real and independent of Wikidata: how does an AI system recognise who you are, separate you from your namesakes, cite you accurately? The answer does not necessarily run through a public knowledge graph. It runs through the signals you control directly.

What an entity signal does

An entity signal is any structured data that helps an AI system resolve entity linking on you or your organisation. It is not a descriptive text, not a bio, not a social profile: it is data in a form a machine can read without having to interpret.

The distinction is fundamental. A text that says "Mario Rossi is a lawyer specialising in employment law in Milan" is useful for a human reader but ambiguous for a model: how many Mario Rossi lawyers are there? Where does this information come from? Is it verifiable? An entity signal resolves that ambiguity by connecting the name to unique references, declaring properties in a format that requires no interpretation, and anchoring statements to sources that hold up under checking.

The logic is the same as entity linking: the model needs the correct answer to be the easiest one to find. Signals are the coordinates that make the correct answer easier than any other.

Markup on your website: JSON-LD and Schema.org

The most direct point of control is your own website. Every page can contain structured markup in JSON-LD, the format recommended by Google and adopted by the major AI systems to read metadata without having to extract it from the page text.

Schema.org provides the vocabulary: a set of recognised types and properties that name entities and their characteristics. The Person type describes a natural person, with properties such as name, jobTitle, worksFor, knowsAbout, sameAs. The Organization type describes an entity, with name, description, foundingDate, areaServed, sameAs. Both types support identifier, which holds unique, verifiable references.

The markup is invisible to the human visitor: it lives in a <script type="application/ld+json"> block in the page head. But for an AI system it is the most reliable information on the page, because it has been declared explicitly by the site owner, not extracted from an ambiguous text.

The properties that actually disambiguate

Not all Schema.org properties carry the same weight for disambiguation. Some are descriptive and help build the profile; others are structurally decisive because they pinpoint a unique entity.

The sameAs property is the most important. It declares that the entity described in the markup is the same one appearing on an external profile or page. Linking the entity to your LinkedIn profile, your GitHub, your institutional website turns those profiles from separate pages into confirmations of the same node. Each sameAs added is a thread connecting different sources to a single verifiable identity.

The identifier property holds sector-specific unique codes: the ORCID for researchers and authors, the tax code or VAT number for Italian professionals and companies, the LEI code for financial entities. These identifiers are not names shared with others: they are numbers assigned once to a single entity. For an AI system they are the equivalent of an identity document.

The worksFor, affiliation and alumniOf properties link the entity to other already-known entities. If the organisation you are affiliated with has a Wikidata entry, connecting to that entry transfers some of its anchoring. You are not in the knowledge graph as a standalone node, but you are hooked to nodes that are already there.

The knowsAbout property, filled with references to defined terms, qualifies expertise in a verifiable way instead of relying on keywords. The difference between saying "works on entity linking" and linking the term to its Wikidata entry is that in the second case the model knows exactly what is being discussed.

Official profiles as anchors

Profiles on recognised platforms, declared as sameAs in the markup, become anchors of the semantic identity. Not because the platforms are knowledge graphs, but because they are sources AI systems already index and weight — and because the explicit declaration of the link reduces ambiguity.

LinkedIn is the most relevant for professionals: it is the source models consult most often for current professional data. A well-structured LinkedIn profile, linked to your website with sameAs, works as a bidirectional confirmation: the site declares that profile is the same node, and the profile describes the node with data the model can read.

GitHub carries similar value for those working in technical or research fields: it is a verifiable source of contributions and projects, with a traceable history. For organisations, profiles on Charity Navigator, official European Union portals, or sector-specific registries serve the same function: they are external authoritative references confirming the entity's existence and characteristics.

The principle is verifiable redundancy: the more independent sources describe the same entity with consistent data, the harder it becomes for a model to confuse it with another or invent details. Each declared profile is a constraint that narrows the margin for error.

llms.txt: instructing models directly

There is a more recent and less widely known tool that approaches the problem from a different direction: the llms.txt file. It is not a W3C standard or a formal vocabulary. It is a convention emerging in the semantic web community, analogous to robots.txt but designed for language models rather than crawlers.

An llms.txt file is a plain-text document placed at the root of the site, readable by both machines and humans, that declares who the site owner is, what they do, which pages contain relevant information and how the information should be used. It does not have the formal structure of JSON-LD, but it has the advantage of being direct: it instructs the model in controlled natural language, with an authority that comes from being hosted on the domain itself.

Its value is not to replace structured markup but to complement it. JSON-LD speaks to the parser of an AI system; llms.txt speaks to the model at the moment it indexes or consults the site. Used together, they cover different layers of the same problem: formal readability and contextual understanding.

The Solid Pod: the frontier of data sovereignty

There is a further level, still uncommon but consistent with the direction of the semantic web: the Solid Pod. The Solid Protocol, developed by Tim Berners-Lee, allows you to store your data in a personal container controlled by the individual, with granular access managed by the owner rather than the platform hosting the data.

A Solid Pod is not a social profile and not a website in the traditional sense: it is a storage point for structured data in RDF, accessible according to rules the owner establishes. Linking your Solid Pod to the entity described in the site markup is a declaration of data sovereignty: it tells the model where to find up-to-date, verifiable information under the direct control of the person it concerns.

We are far from widespread adoption, and it is not realistic to propose a Solid Pod as a first step for someone just starting to build their semantic identity. But it is worth naming because it points in the right direction: less dependence on intermediaries, more direct control over how your entity is represented in the web machines read.

The mistake of treating it as a technical problem

At this point the temptation is to automate. Plugins, generators and scripts exist that produce JSON-LD markup in seconds: fill in some fields, copy the code into the page, done. The markup will be produced, and it will even be formally valid. The tool worked.

But a generator does not know which of your namesakes you are. It does not choose which of your affiliations actually disambiguate your identity. It does not find the authoritative profiles worth declaring as sameAs. It does not decide which competencies to link to verifiable terms and which to leave as free text. It does not know your professional history well enough to understand what needs to be said and what can be left out.

These are judgements, not mechanical operations. The markup is the result of an analysis: what distinguishes this entity from others sharing the same name? What evidence exists that holds up under checking? Where are the authoritative sources that confirm who you are? The JSON-LD syntax is the last step, not the first.

How to verify it is working

The most direct test is also the most useful as a starting point. Before acting, ask the main AI systems — ChatGPT, Gemini, Perplexity — who you are, what you do, what your key projects or outputs are. Note the answers: what is correct, what is missing, what is invented, what belongs to a namesake.

That snapshot serves as a baseline. After building the signals, repeat the same questions. The comparison is verifiable: can the model now separate you from your namesakes? Are the data cited those declared in the markup? Are the affiliations correct? Have you appeared in answers where you were previously ignored?

There is also an immediate technical check: the Schema.org validator lets you verify that the markup is formally correct and that the properties are recognised. It is a necessary check but not sufficient: a valid markup does not guarantee that the content choices are the right ones to disambiguate this specific entity.

Not more content: different structure

It is worth clarifying what this work does not require. It does not require writing more texts, publishing more articles, or building a larger presence on social media. It does not require building a richer website or a more detailed profile in the narrative sense.

It requires taking what already exists — affiliations, competencies, projects, outputs — and making it verifiable. Moving from a form a human can read to a form a machine can use to make decisions. It is a transformation of structure, not of volume.

Those who already have a consolidated digital presence often have more verifiable material than they realise. The problem is not a lack of data; it is that the data exists in formats AI systems cannot use with precision. Building entity signals is the work of closing that gap.

Where to start

The first step is the baseline snapshot: ask the main AI systems who you are today, and note what they describe accurately, what they invent and where they confuse you with a namesake. From that response you can tell where the entity-linking process is failing and which signals are missing.

The second step is an inventory: what verifiable evidence already exists? Which affiliations are documented on authoritative sources? Which unique identifiers apply to you? Which official profiles do you already control and are consistent with each other? What is found here is the raw material of the signals.

The third step is structure: translating that inventory into JSON-LD markup on the site, consistent sameAs declarations across platforms, and an llms.txt that describes the entity in a controlled way. Each of these is one more signal that makes the correct answer easier to find for any system trying to resolve entity linking on you.

I work with professionals, SMEs and impact organisations to build this kind of presence: structured, verifiable, readable by the AI systems that are becoming the first point of contact between those who search and those who exist.

If you want to understand which signals you are missing and where to start, book a 30-minute call.