The End of Practical Obscurity: Deanonymization via LLMs
Burner Accounts Beware
The preservation of online anonymity has traditionally relied on a simple economic reality: piecing together scattered, unstructured digital footprints is highly labor-intensive.
For inventors, intellectual property (IP) professionals, and patent attorneys, this “practical obscurity” has provided a comfortable buffer, allowing employees and researchers to participate in public technical forums without immediately compromising trade secrets or telegraphing corporate initiatives. Generally speaking, your “burner” accounts were relatively safe from being connected or doxxed.
Now, a recent paper for researchers at ETH Zurich titled “Large-scale online deanonymization with LLMs” demonstrates that this buffer is rapidly evaporating.
The research addresses a critical vulnerability in digital privacy, proposing and proving that Large Language Models (LLMs) can automate the extraction and matching of identity signals from raw, unstructured text at scale. As the authors succinctly state, the core thesis of the research is disruptive:
“We show that the practical obscurity that has long protected pseudonymous users (the assumption that deanonymization, while theoretically possible, is too costly to execute broadly) no longer holds.” (Paper, p. 1).
This article examines the mechanisms behind this automated deanonymization pipeline, the empirical results of the study, and the pragmatic implications for professionals tasked with safeguarding proprietary information and navigating litigation in an increasingly transparent digital ecosystem.
Full Citation: Swanson, J., Lermen, S., Aerni, M., Paleka, D., Carlini, N., & Tramèr, F. (2026). Large-scale online deanonymization with LLMs. arXiv preprint arXiv:2602.16800v2.
The Problem: The Collapse of the High-Cost Barrier to Deanonymization
Historically, the ability to uniquely identify individuals from sparse data points is not a novel concept. Researchers have long understood the fragility of anonymized datasets.
However, executing these attacks at a broad scale has always hit a logistical bottleneck. The authors articulate this historical limitation perfectly:
“For decades, it has been known that individuals can be uniquely identified from surprisingly few attributes. Sweeney’s seminal work demonstrated that 87% of the U.S. population could be uniquely identified by just zip code, birth date, and gender [34]. Narayanan and Shmatikov showed that anonymous Netflix ratings could be linked to public IMDb profiles using only a handful of movie preferences [24]...
“Despite these attacks, pseudonymous online accounts (Reddit throwaways, anonymous forums, review profiles, etc) have remained largely unaffected by deanonymization attempts. The reason is simple: applying such attacks in practice has required structured data amenable to algorithmic matching or substantial manual effort by skilled investigators reserved for high-value targets [13].” (p. 1).
This perspective is highly credible and deeply relevant to the fields of cybersecurity and IP management. Security often relies heavily on economic deterrence rather than absolute cryptographic perfection.
If an adversary must spend thousands of dollars in human labor to unmask a single pseudonymous user, widespread dragnet surveillance of forum posts remains economically unviable.
By demonstrating that LLMs can process unstructured text—the very fabric of online communication—with the speed and cost-efficiency of algorithmic matching, the researchers highlight a fundamental shift in the threat model.
The moat protecting casual online discourse has been drained.
Proposed Solution: The Extract-Search-Reason-Calibrate (ESRC) Pipeline
To prove that LLMs can overcome the traditional barriers of unstructured data, the researchers designed a scalable, four-stage attack pipeline.
This methodology does not require bespoke, highly specialized models; rather, it strings together commercially available LLM capabilities to mimic the workflow of a human investigator.
The Extract Stage
“[W]e ask LLMs to identify and structure relevant features from unstructured posts: demographics, writing style, interests, incidental disclosures, etc.” (p. 2).
Instead of relying on neatly organized database columns, this initial step utilizes an AI to read messy, colloquial forum comments and distill them into a structured, easily searchable profile of habits, locations, and technical traits.
The Search Stage
“[W]e encode extracted features into dense embeddings enabling efficient search over thousands or millions of candidate profiles.” (p. 2).
Once a user’s unstructured ramblings are converted into a structured profile, this data is translated into mathematical vectors. This allows the system to rapidly and automatically scan across massive, internet-scale datasets to find candidates with similar attributes, narrowing the field from millions to a manageable shortlist.
The Reason Stage
“[W]e use extended reasoning on top candidates from the search step to identify the most likely match given all available context[.]” (p. 2).
Embedding searches are fast but can lack nuance. In this stage, a more advanced language model acts as a digital detective, reviewing the shortlist of potential matches to cross-reference subtle contextual clues and confirm the identity with a higher degree of accuracy.
The Calibrate Stage
“[W]e prompt LLMs to provide confidences in identified matches (either absolute or relative to other matches), which lets us calibrate the attack to a desired false positive rate.” (p. 2).
A critical feature for any practical intelligence-gathering tool is the ability to self-assess certainty. By forcing the LLM to score its own confidence, operators can filter out weak, speculative guesses and only act on highly probable matches, ensuring the output is actionable and reliable.
Examples: Real-World Testing Across Online Communities
To evaluate the efficacy of the ESRC pipeline, the researchers tested their framework against three distinct datasets featuring known ground-truth data (p. 1).
The experiments included matching users across different Reddit movie discussion communities (p. 1), linking a single user’s Reddit history split across different time periods (p. 1), and linking Hacker News accounts to professional LinkedIn profiles (p. 1).
The Hacker News to LinkedIn experiment is particularly illustrative of the risks posed to corporate and IP ecosystems.
The researchers collected 987 LinkedIn profiles explicitly linked to Hacker News accounts, carefully stripped the Hacker News profiles of all direct identifiers to simulate pseudonymity, and fed them into the pipeline to see if the system could re-establish the connection (p. 6).
The results were striking when compared to classical, non-LLM methods:
“For example, we improve recall from 0.1% to 45.1% at 99% precision when linking Hacker News accounts to LinkedIn profiles[.]” (p. 2).
An improvement from a near-zero success rate to nearly half of all targets accurately identified—at a 99% precision threshold—demonstrates that unstructured professional chatter is highly distinctive.
If an engineer asks a highly specific, anonymized question about a novel polymer on Hacker News, adversaries utilizing this pipeline could trivially link that pseudonymous question back to the engineer’s real-world LinkedIn profile.
The Illusion of the Throwaway Account
The study’s findings severely undermine the presumed safety of “burner,” “throwaway,” or alternative accounts commonly used by professionals to ask sensitive questions without exposing their identity.
By explicitly testing the challenge of “matching a user’s main account to their alt-account” (p. 9), the researchers demonstrated that individuals carry distinctive semantic fingerprints across different platforms and time periods.
Consequently, a highly specific technical inquiry or legal question posted under a newly created pseudonym can still be traced back to the author’s primary public identity by an LLM analyzing their unique traits.
For inventors and attorneys, this confirms that superficially compartmentalizing digital identities is no longer a reliable method for protecting trade secrets or case strategies.
Weaponizing Deanonymization: Litigation, Evidence, and Extortion
The implications of these capabilities extend far beyond corporate espionage; they represent a formidable new frontier in legal strategy and litigation.
The capacity to autonomously match anonymous online histories to verified identities introduces profound risks regarding evidence discovery, witness credibility, and the security of legal counsel.
The paper highlights the broader societal vulnerabilities this creates, noting how threat actors could exploit these systems:
“Hostile groups could identify important employees and decision makers and build online rapport with them to eventually leverage in various forms.” (p. 13).
In a litigation context, this “leverage” takes several concrete forms. First, deanonymization could be weaponized to uncover fraud or undermine testimony.
If an inventor testifies that a specific technology was conceptualized on a certain date, opposing counsel could utilize an LLM pipeline to identify the inventor’s pseudonymous Reddit or GitHub accounts. If those unmasked accounts reveal the inventor seeking troubleshooting advice for that exact technology years prior, the testimony is immediately undermined, and potential fraud is exposed.
Furthermore, the same tools can be directed at the legal system itself. Hackers or opposing entities could target patent attorneys, litigators, clerks, or even judges by unmasking their private, pseudonymous online activities.
The researchers cite related work demonstrating how unmasked data fuels highly targeted attacks:
“[A]dversaries can launch tailored attacks on a user-by-user basis, fundamentally changing the cost-benefit calculus for attackers.” (p. 13).
“[R]ecent work... demonstrate[s] that LLM agents can autonomously crawl public information to construct profiles that were comprehensive for 88% of targets, using them to generate spear phishing emails with click-through rates on par with human experts.” (p. 13).
By uncovering an attorney’s anonymous vents on a legal forum or a judge’s pseudonymous political commentary, malicious actors could formulate sophisticated spear-phishing campaigns to steal confidential case strategies, or worse, use the unmasked activities for direct extortion.
However, a pragmatic view requires acknowledging current limitations within highly sanitized legal documents.
The authors note that applying these models directly to heavily redacted court files remains challenging:
“Nyffenegger et al. [26] evaluate LLM re-identification capabilities on court decisions, finding that despite high re-identification rates on Wikipedia, even the best LLMs struggled with anonymized legal documents.” (p. 12).
While official court redactions may currently present a hurdle, the informal, unstructured text generated by legal professionals and witnesses across the broader internet remains highly vulnerable to automated unmasking.
Closing Thoughts
“Large-scale online deanonymization with LLMs” is a pragmatic wake-up call for data privacy, intellectual property management, and the legal profession. The paper successfully proves that the cost of parsing unstructured text—long considered the ultimate shield for online anonymity—has plummeted.
While the underlying models utilized in the research are not entirely new, the systemic, automated application of these tools to deanonymize users at scale is a sobering development.
For IP attorneys, inventors, and corporate risk officers, this research necessitates a paradigm shift. Trade secrets, competitive intelligence, and litigation strategies can no longer be considered safe simply because related chatter occurs under the guise of an anonymous handle.
As offensive capabilities become cheaper and more automated, the legal and technological industries must look toward funding and developing new defensive protocols—perhaps leveraging the very same LLMs to proactively sanitize employee outputs, audit the digital footprints of key witnesses, or obfuscate identifying semantic markers before they can be weaponized in court.
In the short-term, silence may be the only safe path.
As we like to say around here: Dance like no one is watching. Tweet like one day it will be read in open court.
Disclaimer: This is provided for informational purposes only and does not constitute legal or financial advice. To the extent there are any opinions in this article, they are the author’s alone and do not represent the beliefs of his firm or clients. The strategies expressed are purely speculation based on publicly available information. The information expressed is subject to change at any time and should be checked for completeness, accuracy and current applicability. For advice, consult a suitably licensed attorney and/or patent professional.



