A model that generates output one token at a time, where each new token is predicted based on all the tokens that came before it. Every modern LLM — Claude, GPT, Llama, Gemini — is autoregressive.
Why it matters: Understanding autoregressive generation explains most LLM behaviors: why responses stream token by token, why models sometimes contradict themselves, why longer outputs are slower, and why you can't ask a model to "go back and fix the beginning."
The broad field of building machines that can perform tasks typically requiring human intelligence — understanding language, recognizing images, making decisions, solving problems. AI ranges from narrow systems that excel at one specific task (spam filters, chess engines) to the aspirational goal of general intelligence that can handle any intellectual task a human can.
Why it matters: AI is the umbrella that covers everything else in this wiki — machine learning, deep learning, LLMs, computer vision, robotics. Understanding that "AI" is a spectrum from simple rule-based systems to frontier language models helps you evaluate claims, cut through hype, and understand what today's systems actually are: extraordinarily capable pattern matchers, not thinking machines.
A mathematical function applied to a neuron's output that introduces non-linearity into the network. Without activation functions, a neural network — no matter how many layers deep — would only be able to learn linear relationships. ReLU, GELU, and SiLU/Swish are the most common in modern architectures.
Why it matters: Activation functions are the reason deep learning works at all. A stack of linear transformations is just one big linear transformation. Activation functions between layers let the network learn complex, non-linear patterns — the curves, edges, and subtle relationships that make neural networks powerful.
The study of moral questions raised by AI development and deployment: What biases do AI systems perpetuate? Who is harmed when AI makes mistakes? How should AI decisions be explained? Who is responsible when an autonomous system causes damage? AI ethics encompasses fairness, transparency, accountability, privacy, and the societal impact of AI systems.
Why it matters: AI systems make decisions affecting hiring, lending, criminal justice, healthcare, and content moderation for billions of people. These decisions encode values — whose data was included, what outcomes were optimized for, who was consulted. AI ethics isn't an abstract philosophical exercise; it's the practical question of whether AI systems make the world more fair or less.
Laws and policies governing the development and deployment of AI systems. The EU AI Act (2024) is the most comprehensive, classifying AI systems by risk level and imposing requirements accordingly. The US has taken a more sector-specific approach with executive orders and agency guidelines. China has regulations targeting generative AI, deepfakes, and recommendation algorithms.
Why it matters: Regulation shapes what AI companies can build, how they must build it, and what they must disclose. The EU AI Act affects any company serving European users. Understanding the regulatory landscape is increasingly necessary for anyone building or deploying AI — non-compliance can mean fines, bans, or liability.
为什么重要: Apple Intelligence代表了全球市值最高公司的消费者AI战略,覆盖超过十亿台设备。其对隐私的强调(设备端处理、具有可验证安全性的私有云计算)提供了一种与OpenAI和Google的云优先方法不同的模式。如果Apple在AI上做对了,它将为数十亿非技术用户普及设备端AI。
Apple's on-device and cloud AI system, integrated across iPhone, iPad, and Mac. Apple Intelligence runs smaller models locally on Apple Silicon for privacy-sensitive tasks (text rewriting, summarization, image generation) and routes complex requests to Apple's Private Cloud Compute servers. It also integrates external models (like ChatGPT) with user consent for tasks beyond its own capabilities.
Why it matters: Apple Intelligence represents the consumer AI strategy of the world's most valuable company, reaching over a billion devices. Its emphasis on privacy (on-device processing, Private Cloud Compute with verifiable security) offers a different model than the cloud-first approach of OpenAI and Google. If Apple gets AI right, it normalizes on-device AI for billions of non-technical users.
An Israeli AI company known for Jamba, the first production-grade hybrid architecture that combines Transformer attention layers with Mamba SSM layers. AI21 was founded by AI researchers (including Yoav Shoham) and has been building language models since 2017, predating ChatGPT. Their models are available via API and through cloud providers.
Why it matters: AI21 Labs matters because Jamba proved that hybrid Transformer-SSM architectures work in practice, not just in research papers. By interleaving attention and Mamba layers, Jamba achieves a 256K context window with lower memory usage than pure Transformer models of similar quality. This hybrid approach may be the future of LLM architecture.
The convolutional neural network that won the 2012 ImageNet competition by a massive margin, triggering the deep learning revolution. Created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, AlexNet reduced the image classification error rate from 26% to 16% — a gap so large it convinced the computer vision community that deep learning was fundamentally superior to hand-engineered features.
Why it matters: AlexNet is the "before and after" moment in AI history. Before 2012, most AI researchers worked on feature engineering and non-neural methods. After AlexNet, deep learning became the dominant paradigm. Every modern AI system — GPT, Claude, Stable Diffusion — traces its lineage to the paradigm shift that AlexNet triggered. It's the Big Bang of modern AI.
Amazon Web Services的托管平台,通过统一API访问和部署来自多个提供商(Anthropic、Meta、Mistral、Cohere、Stability AI、Amazon自己的Titan模型)的基础模型。Bedrock处理模型托管、扩展和微调,让企业无需管理GPU基础设施就能使用AI。它还提供护栏、知识库(RAG)和智能体能力。
Bria 是最典型的测试案例,证明基于完全授权的训练数据构建 AI 图像生成技术是否仍能在商业上具有竞争力。在正面临版权诉讼雪崩的行业中,他们的方法为企业提供了一条在不承担法律风险的情况下采用生成式 AI 的途径——这一价值主张在每一起针对竞争对手的新诉讼中都变得更加具有吸引力。如果 Bria 取得成功,这将验证整个负责任的 AI 开发理念;如果它遭遇挫折,则表明市场最终并不足以重视数据来源而愿意为此支付溢价。
A Transformer-based model from Google (2018) that revolutionized NLP by introducing bidirectional pre-training — every token can attend to every other token, giving the model deep contextual understanding. BERT is an encoder-only model: it excels at understanding text (classification, search, NER) but can't generate text like GPT or Claude.
Why it matters: BERT is the most influential NLP paper of the modern era. It proved that pre-training on unlabeled text then fine-tuning on specific tasks could crush every existing benchmark. Even though LLMs have stolen the spotlight, BERT-style models still power most production search engines, embedding systems, and classification pipelines because they're smaller, faster, and cheaper than LLMs for non-generative tasks.
Batch size is how many training examples the model processes before updating its parameters. An epoch is one complete pass through the entire training dataset. A model trained for 3 epochs on 1 million examples with batch size 1,000 processes 1,000 examples per update, takes 1,000 updates per epoch, and 3,000 updates total.
Why it matters: Batch size and epochs are the most fundamental controls in training. Batch size affects training speed, memory usage, and even what the model learns (small batches add noise that can help generalization; large batches converge faster but may generalize worse). Number of epochs determines how many times the model sees each example — too few and it underfits, too many and it overfits.
The most common algorithm for building tokenizer vocabularies. BPE starts with individual bytes or characters and iteratively merges the most frequent adjacent pair into a new token. After thousands of merges, common words become single tokens ("the," "function") while rare words are split into subword pieces ("un" + "common"). Used by GPT, Claude, Llama, and most modern LLMs.
Why it matters: BPE is the reason your tokenizer works the way it does. It explains why common words are cheap (one token), why rare words are expensive (many tokens), and why non-English text costs more (fewer merges allocated to non-English character pairs). Understanding BPE helps you predict token counts, optimize prompts, and understand why different tokenizers produce different results for the same text.
A decoding strategy that maintains multiple candidate sequences (the "beam") simultaneously, expanding each by one token at each step and keeping only the top-scoring candidates. Unlike greedy decoding (always pick the best next token) or sampling (randomly pick), beam search explores multiple paths and finds the overall highest-probability sequence. Commonly used for translation and summarization.
Why it matters: Beam search shows that the locally best choice isn't always globally best. Greedy decoding might pick "The" as the first word when "In" would lead to a much better overall sentence. By keeping multiple candidates, beam search avoids committing too early. However, for open-ended generation (chat, creative writing), sampling produces more diverse and natural text than beam search.
The algorithm that computes how much each parameter in a neural network contributed to the error, enabling gradient descent to update parameters efficiently. Backpropagation applies the chain rule of calculus in reverse through the network: starting from the loss at the output, it propagates gradients backward through each layer to determine each weight's share of the blame.
Why it matters: Backpropagation is the algorithm that makes neural network training possible. Without an efficient way to compute gradients for billions of parameters, gradient descent would be computationally infeasible. Every model you use — from a small classifier to a 400B LLM — was trained using backpropagation. It's the single most important algorithm in deep learning.
Cartesia 的重要性在于他们证明了状态空间模型不仅仅是学术界的奇思妙想,而是一种适用于实时语音 AI 的商业可行架构。他们低于 100 毫秒的延迟首次使真正自然的对话式 AI 成为可能,弥合了“与机器人对话”和“与真人对话”之间的差距。随着行业向以语音为核心的 AI 代理转变,Cartesia 在流媒体速度方面的架构优势可能使他们成为其他所有企业构建的基础设施层。
A software interface that lets you interact with an AI model through conversation. Modern AI chatbots (Claude, ChatGPT, Gemini) are powered by large language models and can handle open-ended dialogue, answer questions, write code, and use tools.
Why it matters: Chatbots are how most people interact with AI. Understanding conversation history, system prompts, context windows, and token limits helps you use them more effectively.
为什么重要: 每个主要模型都是用受版权保护的材料训练的。正在进行的诉讼(纽约时报 vs. OpenAI、Getty vs. Stability)将重塑AI经济格局。
The unresolved legal questions around AI and intellectual property: Can AI training on copyrighted data constitute fair use? Who owns AI-generated content? Can AI output infringe copyright?
Why it matters: Every major AI model was trained on copyrighted material. Current lawsuits will reshape the economics of AI training and whether creators get compensated.
An AI-native code editor built as a fork of VS Code, integrating LLMs deeply into the editing experience: inline code generation, multi-file editing, and codebase-aware context.
Why it matters: Cursor represents a bet that AI will fundamentally change how code is written. Its rapid adoption makes it one of the most tangible examples of AI changing knowledge work.
The task of assigning an input to one of a predefined set of categories. "Is this email spam or not?" (binary classification). "Is this image a cat, dog, or bird?" (multi-class). "Which of these tags apply to this article?" (multi-label). Classification is the most common supervised learning task and the foundation of countless real-world AI applications.
Why it matters: Classification is where most people first encounter machine learning in practice — spam filters, content moderation, medical diagnosis, fraud detection, sentiment analysis. Understanding classification helps you understand the entire supervised learning pipeline: labeled data in, trained model, predictions out.
A neural network architecture designed to process grid-like data (images, audio spectrograms) by sliding small filters (kernels) across the input to detect local patterns like edges, textures, and shapes. CNNs dominated computer vision from 2012 (AlexNet) until Vision Transformers emerged around 2020. They're still widely used in production, especially on edge devices.
Why it matters: CNNs kicked off the deep learning revolution. AlexNet's 2012 ImageNet victory proved that deep neural networks could dramatically outperform hand-engineered features, triggering the current AI boom. Understanding CNNs helps you understand why Transformers work (many of the same ideas — hierarchical features, parameter sharing — apply), and CNNs remain the best choice for many vision tasks on resource-constrained devices.
An alignment technique developed by Anthropic where a model is trained to follow a set of principles (a "constitution") rather than relying solely on human feedback for every decision. The model critiques and revises its own outputs based on these principles, then is trained on the revised outputs. This reduces the need for human labelers and makes the alignment criteria explicit and auditable.
Why it matters: Constitutional AI addresses two problems with RLHF: it's expensive (human labelers for every training example) and opaque (the criteria are implicit in labeler judgments). By making the principles explicit, CAI makes alignment more transparent, scalable, and consistent. It's a core part of how Claude is trained.
When a neural network trained on a new task loses its ability to perform previously learned tasks. Fine-tuning a model on customer support data might make it great at support but terrible at coding. The new learning overwrites the weights that encoded the old capabilities, "forgetting" them.
Why it matters: Catastrophic forgetting is the central challenge of fine-tuning and continual learning. It's why you can't just keep fine-tuning a model on task after task and expect it to do everything well. It's also why techniques like LoRA (which only modify a small subset of parameters) and careful learning rate selection are critical for preserving base model capabilities.
When benchmark test data appears in a model's training data, inflating its scores without reflecting genuine capability. If a model "studied the answer key" by seeing test questions during training, its benchmark performance is meaningless. Contamination is a growing problem as training datasets get larger and scrape more of the internet, where benchmark data is often published.
Why it matters: Contamination undermines the entire benchmark system that the AI industry uses to compare models. A model that scores 90% on MMLU because it memorized the answers isn't smarter than one scoring 80% that never saw them. As more benchmarks leak into training data, the community is forced to create new benchmarks constantly, and private held-out evaluations become more important than public leaderboards.
A crowdsourced platform (by LMSYS) where users chat with two anonymous AI models side-by-side and vote for which response is better. The results are used to compute ELO ratings — the same ranking system used in chess — creating a continuously updated leaderboard of model quality based on real human preferences rather than automated benchmarks.
Why it matters: Chatbot Arena is arguably the most trusted model comparison today because it's resistant to contamination (questions are novel), reflects real user preferences (not synthetic benchmarks), and pits models head-to-head (relative comparison is more reliable than absolute scores). When people say "Claude is better than GPT for coding" or vice versa, the Arena rankings are often the evidence.
A chip company that builds wafer-scale AI processors — chips the size of an entire silicon wafer, over 100x larger than a standard GPU. The Cerebras WSE-3 (Wafer Scale Engine) contains 4 trillion transistors and 900,000 cores. Their CS-3 systems are designed for both training and inference, offering an alternative to clusters of thousands of individual GPUs.
Why it matters: Cerebras represents the most radical rethinking of AI hardware. Instead of connecting thousands of small chips with limited bandwidth, they put everything on one massive chip with enormous on-chip memory bandwidth. The potential advantage is eliminating the communication bottleneck that limits multi-GPU training. Whether wafer-scale computing can compete with NVIDIA's massive ecosystem is the billion-dollar question.
A measure of similarity between two vectors based on the angle between them, ignoring their magnitude. Cosine similarity of 1 means the vectors point in the same direction (identical meaning). 0 means they're perpendicular (unrelated). -1 means opposite directions. It's the standard similarity metric for comparing text embeddings in semantic search, RAG, and recommendation systems.
Why it matters: Every time you do semantic search, use RAG, or compare embeddings, cosine similarity is (probably) the metric deciding what's "similar." Understanding it helps you debug retrieval quality, choose between cosine and alternatives (dot product, Euclidean distance), and understand why some searches miss obvious matches.
A model from OpenAI (2021) that learns to connect images and text by training on 400 million image-caption pairs. CLIP encodes images and text into the same embedding space, where matching image-text pairs are close together and non-matching pairs are far apart. It's the bridge between language and vision in most modern multimodal AI systems.
Why it matters: CLIP is the backbone of text-to-image generation (Stable Diffusion, DALL-E), image search, zero-shot image classification, and multimodal understanding. When you type a prompt and get an image, CLIP (or a descendant) is what connects your words to visual concepts. It proved that you can learn powerful visual representations from natural language supervision alone, without labeled image datasets.
An architecture that adds spatial control to image generation models. Instead of just describing what you want in text ("a person standing"), ControlNet lets you specify how — providing an edge map, depth map, pose skeleton, or segmentation map that guides the composition. The generated image follows the spatial structure of your control input while filling in details from the text prompt.
Why it matters: ControlNet made AI image generation usable for professional workflows. Without it, you get random compositions and hope for the best. With it, you specify the exact pose, layout, or structure you need. This is the difference between "generate something vaguely like what I want" and "generate exactly this composition with these details" — critical for design, advertising, and production work.
A self-supervised learning approach that trains models by contrasting positive pairs (similar items that should be close in embedding space) against negative pairs (dissimilar items that should be far apart). CLIP contrasts matching image-text pairs against non-matching ones. SimCLR contrasts augmented views of the same image against views of different images. The model learns representations where similarity in embedding space reflects real-world similarity.
Why it matters: Contrastive learning is how most embedding models are trained — the models that power semantic search, RAG, and recommendations. It's also the training approach behind CLIP, which connects language and vision. Any time you use embeddings to measure similarity, contrastive learning is likely how those embeddings were created.
A saved snapshot of a model's state during training — the weights, optimizer state, learning rate schedule, and training step. Checkpoints let you resume training after interruptions (hardware failure, preemption), evaluate intermediate versions of the model, and roll back to an earlier version if training degrades. Saving checkpoints every few thousand steps is standard practice.
Why it matters: Training large models takes days to months. Without checkpoints, a GPU failure at step 90,000 of a 100,000-step training run means starting over. Checkpoints are insurance: they save progress incrementally so you only lose work since the last checkpoint. They also enable model selection — sometimes an earlier checkpoint performs better on your evaluation metrics than the final one.
由库、框架和平台组成的生态系统,使构建人工智能驱动的应用程序更加容易—这包括编排框架(LangChain、LlamaIndex)、推理服务器(vLLM、llama.cpp)、微调工具(Axolotl、Unsloth)、评估框架(LMSYS、Braintrust)以及全栈平台(Vercel AI SDK、Hugging Face)。工具生态每月都在变化。
DeepL 是一个有力证明,说明专注于 AI 的公司可以在核心能力上持续超越拥有数千亿美元市值的竞争对手。在通常认为“越大越好”的领域,DeepL 在翻译质量上对 Google 和 Microsoft 的优势依然可衡量且具有实际意义,尤其是在欧洲语言和专业应用场景中。他们的成功挑战了通用 AI 模型最终会将专业任务商品化的假设——对于依赖准确跨语言沟通的数以万计的企业而言,这种专业化是值得付费的。
为什么重要: Decart AI 展示了大多数人认为还需数年才能实现的技术:一个神经网络实时生成可玩、可交互的3D世界,且无需传统游戏引擎。他们的 Oasis 演示是对 AI 原生世界模拟的概念验证,这项技术的影响远超游戏领域—从自动驾驶到机器人技术再到空间计算。如果实时世界模型能够达到生产级别的实用性,Decart 在推理优化和交互生成方面的早期工作将奠定基础。
Deepgram 证明了一家初创公司可以利用端到端深度学习从零开始构建语音识别系统,并在准确性上与谷歌、亚马逊和微软直接竞争,同时在速度上超越它们。他们以开发者为中心的 API 方法将现代基础设施模式引入语音 AI,使将语音转录添加到应用程序中变得像通过 Stripe 添加支付功能一样简单。随着对话式 AI 代理逐渐成为主流,Deepgram 正将其定位为关键的语音基础设施层——使以语音为核心的 AI 在实际生产中真正运作的底层管道。
Training a smaller "student" model to mimic a larger "teacher" model by learning from the teacher's soft probability distributions rather than hard labels.
Why it matters: Distillation is how the industry makes powerful AI accessible. A 70B model distilled into 7B can capture 90% of the capability at 10% of the cost.
An alternative to RLHF for aligning language models with human preferences. DPO directly optimizes the model using pairs of preferred and rejected responses, without needing a separate reward model.
Why it matters: DPO democratized alignment by collapsing RLHF's complex pipeline into a single training step. Many recent open-weight models use DPO instead of RLHF.
A structured collection of data used to train, evaluate, or test a machine learning model. Datasets can be labeled (each example has a known correct answer) or unlabeled (raw data without annotations). The quality, size, diversity, and representativeness of a dataset fundamentally determine what a model can learn.
Why it matters: Garbage in, garbage out. The most elegant architecture trained on a bad dataset will produce bad results. Conversely, a simple model trained on excellent data often outperforms a complex model trained on noise. Dataset curation is arguably the most impactful and least glamorous part of AI development.
A regularization technique that randomly "turns off" a fraction of neurons during each training step by setting their outputs to zero. This prevents the network from relying too heavily on any single neuron, forcing it to learn distributed, robust representations. At inference time, all neurons are active but scaled accordingly.
Why it matters: Dropout is the simplest and most widely-used defense against overfitting. Without regularization, large neural networks memorize training data instead of learning generalizable patterns. Dropout (and its cousin weight decay) are why models can be much larger than their training sets without just memorizing everything.
An architecture that replaces the U-Net backbone traditionally used in diffusion models with a Transformer. DiT applies the attention mechanism to image generation, enabling the same scaling behavior that made LLMs so powerful. Sora, Flux, Stable Diffusion 3, and most state-of-the-art image and video generators use DiT or variants.
Why it matters: DiT unified the worlds of language and image generation under a single architectural paradigm: the Transformer. This means the scaling laws, training techniques, and optimization strategies developed for LLMs largely transfer to image and video generation. It's why image quality has improved so rapidly — the field is riding the same scaling curve as language.
Techniques that artificially expand a training dataset by creating modified versions of existing examples. For images: flipping, rotating, cropping, color shifting. For text: paraphrasing, back-translation, synonym substitution. For audio: speed changes, noise injection. The goal is to teach the model invariances — a cat is a cat whether the image is flipped, darkened, or cropped.
Why it matters: Data augmentation is the cheapest way to improve model performance when you have limited data. It reduces overfitting by showing the model many variations of each example, teaching it to focus on essential features rather than superficial details. In computer vision, augmentation routinely provides 2–5% accuracy improvements for free.
Running AI models directly on end-user devices — phones, laptops, cars — rather than in the cloud. Your data never leaves your device, latency is near-zero, and it works offline.
Why it matters: Edge AI is where privacy, latency, and cost intersect. A fast 3B model on your phone beats a slow 400B model in a data center for many tasks.
A model architecture with an encoder that compresses input and a decoder that generates output from it. T5 and BART are encoder-decoder. GPT/Claude/Llama are decoder-only. BERT is encoder-only.
Why it matters: Understanding encoder-decoder vs. decoder-only explains why different models excel at different tasks and why the field converged on decoder-only for LLMs.
A lookup table that maps each token in the vocabulary to a dense vector (the token's embedding). When the model receives token ID 42, the embedding layer returns row 42 of a learned matrix. This vector is the model's initial representation of that token — the starting point for all subsequent processing through attention and feedforward layers.
Why it matters: The embedding layer is where text becomes math. Every LLM starts by converting discrete tokens (words, subwords) into continuous vectors that the neural network can process. The embedding table is also one of the largest components of small models — a 128K vocabulary with 4096-dimensional embeddings is 512 million parameters. Understanding this helps you reason about model sizes and vocabulary design.
Providing example input-output pairs in your prompt to teach the model a pattern. Zero-shot = no examples, one-shot = one, few-shot = 2–10. The model learns the pattern without any training.
Why it matters: Few-shot is the fastest, cheapest way to customize model behavior. It works because LLMs are extraordinary pattern matchers — one of the most surprising capabilities to emerge from scale.
A generative technique that transforms noise into data by following smooth, direct paths. Fewer steps than diffusion models for comparable quality, making generation faster.
Why it matters: Flow matching is replacing diffusion for image and video generation. Flux, Stable Diffusion 3, and several video models use it. Fewer steps = faster inference = lower costs.
A structured way for AI models to request execution of external functions during a conversation. You define functions with names, descriptions, and parameter schemas. When the model determines a function would help answer a query, it outputs a structured function call (with arguments) instead of text. Your code executes the function and returns the result for the model to incorporate.
Why it matters: Function calling is what turns a chatbot into an agent. Without it, a model can only generate text. With it, a model can search databases, call APIs, run calculations, book appointments, send emails — anything you can expose as a function. It's the mechanism behind every AI assistant that actually does things rather than just talking about them.
Floating Point Operations — the standard measure of computational work in AI. Training a model requires a certain number of FLOPs (total operations). Hardware is rated in FLOP/s (operations per second). An H100 GPU can perform ~2,000 TFLOP/s (2 quadrillion operations per second) in FP16. GPT-4's training is estimated at ~10^25 FLOPs — a number so large it's hard to comprehend.
Why it matters: FLOPs are the currency of AI compute. Scaling laws are expressed in FLOPs. Training budgets are measured in FLOPs. GPU comparisons use FLOP/s. Understanding FLOPs helps you estimate training costs, compare hardware, and understand why AI progress is so closely tied to compute scaling. When people say "scaling compute," they mean spending more FLOPs.
The algorithm that trains neural networks by iteratively adjusting parameters to reduce the loss. Computes how much each parameter contributed to the error and nudges it in the direction that reduces it.
Why it matters: Every model you use was trained by gradient descent. Understanding it explains why learning rate matters, why training can diverge, and why optimizers like Adam work.
A chip company building custom AI inference processors (LPUs) purpose-built for sequential token generation, achieving 500–800 tokens/sec — often 10x faster than GPU alternatives.
Why it matters: Groq demonstrated that LLM inference doesn't have to be slow. Their speed comes from hardware, not software, suggesting GPUs may not be the long-term winner for inference.
An attention variant where multiple query heads share a single key-value head, reducing the KV cache size without significantly reducing quality. Instead of every query head having its own K and V projections (standard MHA), groups of query heads share K and V projections. Llama 2 70B, Mistral, Gemma, and most modern LLMs use GQA.
Why it matters: GQA is the practical solution to the KV cache memory problem. Standard multi-head attention with 64 heads needs 64 sets of K and V tensors per layer in the cache. GQA with 8 KV heads reduces this to 8 sets — an 8x memory reduction. This directly translates to serving more concurrent users or handling longer contexts on the same hardware.
A memory-saving technique that trades compute for memory during training. Instead of storing all intermediate activations from the forward pass (needed for backpropagation), gradient checkpointing only stores activations at certain "checkpoint" layers and recomputes the others during the backward pass. This reduces memory usage by up to 5–10x at the cost of ~30% more compute.
Why it matters: Gradient checkpointing is what makes it possible to fine-tune large models on limited GPU memory. Without it, a 7B model might need 80+ GB just for activations during training, exceeding a single GPU's capacity. With gradient checkpointing, the same model can be fine-tuned on a 24GB consumer GPU. It's the most commonly used memory optimization for training.
为什么重要: HeyGen 将 AI 视频虚拟形象从一项研究课题转变为真正的企业级工具,证明了将视频内容创作变得像写文档一样简单可以带来实际收入。他们的口型同步配音技术对全球企业具有特殊意义 — 它将视频本地化的成本和时间从数周和数千美元大幅降低到分钟和美分。作为少数几家拥有稳定持续性收入的 AI 视频公司之一,HeyGen 也成为了如何在生成式 AI 上构建真正商业的案例研究,而不仅仅是一个演示。
The central hub of open-source AI. Hosts 500K+ models, 100K+ datasets, the Transformers library, and Spaces for demos. To AI what GitHub is to code.
Why it matters: If you work with open-weight models, you use Hugging Face. Every Llama, Mistral, and Qwen download comes from there. The Transformers library is the de facto standard.
A specific two-attention-head circuit discovered in Transformers that implements in-context learning by pattern matching. If the model has seen the pattern "A B" earlier in the context and now sees "A" again, the induction head predicts "B" will follow. This simple mechanism is believed to be a fundamental building block of how LLMs learn from examples in their context.
Why it matters: Induction heads are the best-understood circuit in mechanistic interpretability — a concrete example of how Transformers implement a useful algorithm from learned weights. They explain why few-shot prompting works: when you give examples, induction heads detect the pattern and apply it. Understanding induction heads provides a foundation for understanding more complex learned behaviors.
Jina AI 构建了数千个 RAG 系统依赖的嵌入和检索基础设施,证明了专注于搜索工具的价值可以胜过试图面面俱到。他们的长上下文嵌入模型和 Reader API 解决了 AI 驱动搜索中两个最难的实际问题——准确表示长文档以及从杂乱的网页中提取干净文本——同时他们还保持了核心模型的开源。在一个由通用型实验室主导的生态系统中,Jina 展示了专注做好一件事并让开发者极其简单易用,确实可以成为一项真正的商业价值。
A memory optimization storing previously computed attention key/value tensors so they don't need recomputation for each new token. Trades memory for speed.
Why it matters: The KV cache is why LLM inference is memory-bound. A 100K context can consume tens of GB for cache alone. It's why long contexts cost more and why paged attention matters.
Techniques for modifying specific facts in a trained model without retraining it. If a model incorrectly states "The president of France is Macron" after a new election, knowledge editing can update this specific fact by modifying targeted weights, without affecting the model's other knowledge or capabilities. The goal is surgical precision: change one fact, leave everything else intact.
Why it matters: Knowledge editing addresses a practical problem: models become outdated, and retraining is expensive. If you could update specific facts cheaply, models could stay current between major training runs. It also has safety implications: could you edit out dangerous knowledge? The field is promising but immature — edits often have unintended side effects on related knowledge.
MIT 分拆公司正在探索受生物神经回路启发的、从根本上不同的神经网络架构。他们的 Liquid Foundation Models 使用连续时间动力学而非固定权重的 Transformer,有望实现更高的效率和适应性。
为什么重要: Liquid AI 代表着对“transformer 是唯一重要的架构”这一假设的最严重资金支持的挑战。通过基于生物启发的连续时间动力学构建生产级基础模型,他们正在测试AI行业对注意力机制的全押赌注是否过于仓促。即使LFMs无法彻底取代transformer,它们在边缘部署和长序列处理方面的效率优势,可能在机器人、移动AI和嵌入式系统等领域开辟关键细分市场——在这些市场中,运行一个70B参数的transformer模型根本不可行。
Luma AI 通过让 AI 视频生成变得免费、快速且任何有浏览器的人都可以使用,就像 Stable Diffusion 让图像生成民主化一样。他们从 3D 捕捉初创公司发展成为领先的视频生成公司,结合在空间理解方面的独特技术深度,使他们成为少数几家真正能够弥合 AI 视频、3D 内容与下一代沉浸式媒体格式之间差距的公司之一。
A mathematical function measuring how wrong a model's predictions are. For LLMs, cross-entropy loss measures how surprised the model is by the actual next token. Training minimizes this number.
Why it matters: The loss function is the compass of training. Everything a model learns serves to reduce it. Understanding loss helps you interpret training curves and diagnose problems.
为什么重要: LSTM是2010年代NLP的支柱:机器翻译、语音识别、文本生成和情感分析都运行在LSTM上。理解LSTM有助于你理解为什么Transformer取代了它(并行性和长程注意力 vs. 顺序处理和压缩状态)以及为什么像Mamba这样的SSM令人感兴趣(它们以现代改进重新审视了门控状态的思想)。
A strategy for changing the learning rate during training rather than keeping it constant. Most modern training uses warmup (gradually increase from near-zero to peak) followed by decay (gradually decrease toward zero). Cosine annealing is the most common decay schedule. The learning rate controls how large each gradient update step is — arguably the most important hyperparameter in training.
Why it matters: Getting the learning rate schedule right can make or break a training run. Too high and the model diverges (loss spikes, training fails). Too low and it trains too slowly or gets stuck. The schedule interacts with batch size, model size, and data — there's no universal setting. Understanding learning rate schedules helps you interpret training curves and diagnose training issues.
Meta AI 通过证明前沿级模型可以以开放权重的形式发布,从根本上改变了人工智能的经济模式。Llama 及其衍生模型驱动着成千上万的应用程序、初创公司和研究项目,这些项目此前从未有机会接触到如此级别的模型。PyTorch 是全球大多数人工智能研究和生产系统的基础。而其应用覆盖了30亿以上的用户,Meta 的分发能力是其他任何人工智能实验室都无法企及的——当它们推出一项AI功能时,这项功能能在一夜之间覆盖全球三分之一的人口。
Mistral证明了,你不需要美国超大规模云服务商的预算来构建前沿AI模型。他们的高效架构——尤其是早期在稀疏Mixture of Experts方面的研究——影响了整个行业对模型设计的思路,并通过开源权重发布,让全球开发者无需依赖API即可访问高质量模型。作为首家达到真正前沿竞争水平的欧洲AI公司,Mistral还具有战略意义:他们的成功(或失败)将决定欧洲能否成为AI领域的参与者,而不仅仅是监管者。
A selective state space model architecture challenging the Transformer. Achieves competitive performance with linear scaling in sequence length by maintaining a compressed, selectively updated hidden state.
Why it matters: Mamba is the most credible challenge to Transformer dominance. Linear-time processing with comparable quality would mean longer contexts, faster inference, lower costs. Hybrid architectures are already shipping.
Reverse-engineering what happens inside neural networks at the level of neurons, circuits, and features — not just what the model outputs, but how it computes those outputs.
Why it matters: If we trust AI with important decisions, we need to understand how it makes them. Researchers have identified specific circuits inside Transformers. Central to Anthropic's safety research.
An AI image generation company known for aesthetically refined output. Operates through Discord and web. Runs profitably with a small team focused on artistic quality over benchmarks.
Why it matters: The most popular AI image generator for creative use. Proves that AI success isn't just about architecture; curation and user experience matter enormously.
The infrastructure and software that runs trained AI models in production, handling incoming requests, managing GPU memory, batching for efficiency, and returning responses. Model serving frameworks like vLLM, TGI (Text Generation Inference), and TensorRT-LLM handle the complex engineering of making LLM inference fast and cost-effective at scale.
Why it matters: The gap between "I have a model" and "I can serve 10,000 users simultaneously" is enormous. Model serving frameworks solve GPU memory management, request scheduling, KV cache optimization, and continuous batching — problems that are hard to solve from scratch. Choosing the right serving stack is one of the highest-leverage decisions in production AI.
Running multiple attention operations in parallel, each with its own learned projection of the queries, keys, and values. Instead of one attention function looking at the full model dimension, multi-head attention splits the dimension into multiple "heads" (e.g., 32 heads of 128 dimensions each for a 4096-dimension model). Each head can focus on different types of relationships simultaneously.
Why it matters: Multi-head attention is why Transformers are so expressive. One head might focus on syntactic relationships (subject-verb), another on positional patterns (nearby words), another on semantic similarity. This parallel specialization lets the model capture many types of dependencies simultaneously, which a single attention head can't do as effectively.
A self-supervised training objective where random tokens in the input are replaced with a [MASK] token, and the model must predict the original tokens from context. BERT popularized MLM: mask 15% of tokens, use bidirectional attention to look at both left and right context, and predict the masked words. This creates powerful text understanding models (as opposed to text generation models).
Why it matters: MLM is the training objective that created BERT and the entire family of encoder models that still power most production search, classification, and embedding systems. Understanding MLM vs. causal language modeling (next-token prediction) explains the fundamental split between understanding models (BERT) and generation models (GPT) — and why each excels at different tasks.
Combining the weights of multiple fine-tuned models into a single model without any additional training. If model A is great at coding and model B is great at creative writing, merging them can produce a model that's good at both. Popular merging methods include SLERP (spherical interpolation), TIES (resolving sign conflicts), and DARE (randomly dropping parameters before merging).
Why it matters: Model merging is the open-source community's secret weapon. It costs zero compute (just math on weight tensors) and can produce models that outperform their components. Many top models on the Open LLM Leaderboard are merges. It's also how practitioners combine multiple LoRA fine-tunes into a single versatile model. Understanding merging unlocks a powerful, free capability for anyone working with open models.
Techniques that stabilize neural network training by normalizing the values flowing through the network to have consistent scale. Layer Normalization (LayerNorm) normalizes across features within each example. RMSNorm is a simplified variant. Batch Normalization (BatchNorm) normalizes across the batch. Every Transformer uses some form of normalization between layers.
Why it matters: Without normalization, deep networks are extremely difficult to train — activations can explode or vanish across layers, making gradient descent unstable. Normalization is one of those unglamorous techniques that is absolutely essential: remove it from any modern architecture and training collapses.
OpenAI 是所有组织中最具影响力的,将人工智能从研究实验室带入主流意识的先行者。ChatGPT 是生成式 AI 的 iPhone 时刻——这款产品让数亿人直观地理解了大型语言模型的潜力。他们的 API 构建了基础设施层,成千上万的 AI 初创企业正是基于此建立起来的,而 GPT 系列在多年间确立了规模扩展作为人工智能研究的主导范式。即使 OpenAI 的争议——治理危机、从非营利到营利的转变、专注于安全的研究人员的离职——也塑造了关于人工智能公司应该如何构建和治理的更广泛讨论。
A measurement of how well a language model predicts text. Represents how many tokens the model is choosing between at each step. Lower = better predictions.
Why it matters: The most fundamental metric for comparing language models. But perplexity alone doesn't tell you if a model is helpful or safe.
The text you give to an AI model to get a response. A prompt can be a question, an instruction, a creative brief, or code you want explained. Its quality directly shapes the output.
Why it matters: The prompt is the interface. A vague prompt gets a vague answer; a specific one extracts expert-level output from the same model. Step one of using AI effectively.
A mechanism that tells a Transformer model the order of tokens in a sequence. Unlike RNNs which process tokens sequentially (so position is implicit), Transformers process all tokens in parallel and have no inherent sense of order. Positional encodings inject position information so the model knows that "dog bites man" and "man bites dog" are different.
Why it matters: Without positional information, a Transformer treats a sentence as a bag of words — word order is lost. The choice of positional encoding also determines how well a model handles sequences longer than those seen during training, which is why techniques like RoPE and ALiBi are critical for long-context models.
A memory management technique for KV cache that borrows from operating system virtual memory. Instead of allocating a contiguous block of GPU memory for each request's KV cache (which wastes memory through fragmentation), PagedAttention stores cache in non-contiguous blocks ("pages") that are allocated on demand and can be shared across requests with common prefixes.
Why it matters: PagedAttention is the innovation behind vLLM and is now adopted by most LLM serving frameworks. It increased serving throughput by 2–4x compared to naive implementations by eliminating memory waste from fragmentation. Without it, serving long-context models to many concurrent users would be dramatically more expensive.
An operation that reduces the spatial dimensions of data by summarizing a region into a single value. Max pooling takes the maximum value in each region. Average pooling takes the mean. In CNNs, pooling layers downsample feature maps between convolutional layers. In Transformers, pooling combines token representations into a single vector (e.g., for classification).
Why it matters: Pooling is how neural networks go from local features to global understanding. A CNN might start with 224×224 feature maps and pool down to 7×7 by the final layer, progressively summarizing spatial information. In NLP, mean pooling over token embeddings is the standard way to create a single sentence embedding from a sequence of token representations.
Resemble AI 之所以重要,是因为他们很早就意识到,缺乏安全基础设施的语音克隆是一种隐患,而非产品。通过在推出语音合成工具的同时,也推出深度伪造检测和神经水印技术,他们为负责任的语音AI树立了典范,如今整个行业都在争相效仿。随着全球对合成媒体的监管不断加严,Resemble 在来源验证和同意验证方面的先发优势,使其成为企业真正可以信赖的语音AI公司。
Runway 是将 AI 视频生成从研究探索转变为电影制作工具的公司,以不断推出模型的速度保持领先地位,即使资金雄厚的竞争对手进入该领域。他们以创意工具为核心的理念——源自艺术家而非仅工程师——使他们对专业工作流程的理解是纯研究实验室难以复制的,而他们选择构建综合平台而非仅仅一个模型的策略,可能证明是正确的长期布局。
A neural network that processes sequences by maintaining a hidden state that gets updated at each step — it "remembers" what it's seen so far. LSTMs and GRUs are improved variants that solve the original RNN's tendency to forget long-range dependencies. RNNs dominated NLP and speech before Transformers replaced them around 2018–2020.
Why it matters: RNNs are the ancestors of modern language models. Understanding why they failed (slow sequential processing, difficulty with long-range dependencies) explains why Transformers succeeded (parallel processing, attention over all positions). The SSM/Mamba architecture is, in some ways, a return to the RNN idea with modern fixes.
A connection that bypasses one or more layers by adding the input directly to the output: output = layer(x) + x. Instead of each layer learning a complete transformation, it only needs to learn the "residual" — the difference from the identity function. Residual connections are in every Transformer layer and are essential for training deep networks.
Why it matters: Without residual connections, deep networks are nearly impossible to train — gradients vanish or explode across many layers. Residual connections provide a gradient highway that lets information (and gradients) flow directly from early layers to late layers, bypassing any number of intermediate transformations. They're why we can train 100+ layer networks at all.
A variant of RLHF where the preference labels come from an AI model instead of human annotators. A strong AI model compares response pairs and indicates which is better, providing the feedback signal for reinforcement learning. This scales alignment beyond the bottleneck of human labeling while maintaining reasonable quality.
Why it matters: RLAIF is how alignment scales. Human annotation is expensive ($10–50+ per hour), slow, and inconsistent. AI feedback is instant, cheap, and tireless. Constitutional AI (Anthropic) uses RLAIF as a core component — an AI critiques responses against principles, providing preference data at scale. The key question is whether AI feedback is good enough: it bootstraps from human judgment but may inherit and amplify biases.
对大型语言模型的批评认为,它们仅仅是复杂的模式匹配器,通过拼接看似合理的文本而并不具备对意义的理解。这一术语由艾米丽·班德(Emily Bender)、蒂米特·格布鲁(Timnit Gebru)及其同事在其具有影响力的2021年论文《On the Dangers of Stochastic Parrots》中提出,该论文警告称,大型语言模型会从训练数据中编码偏见,消耗巨大资源,并制造出一种理解的幻觉,使用户误以为它们比实际情况更值得信赖。
在很大程度上被全球人工智能行业忽视的一个问题,Sarvam AI 提供了最可信的答案:谁来为全球五分之一人口实际使用的语言构建基础模型?凭借在印度人工智能研究社区、政府机构的深厚根基,以及专为印度语言多样性打造的产品架构,Sarvam 既代表了商业机遇,也是一项战略要务。他们的成功或失败将表明人工智能革命是否真正实现全球化,还是仅仅停留在以英语为主的现象,强行附加翻译。
Empirical power-law relationships: model performance improves predictably with more parameters, data, and compute. You can estimate how good a model will be before spending millions training it.
Why it matters: Scaling laws turned training from guesswork into engineering. They also explain the AI arms race: predictable returns on compute investment drive ever-larger clusters.
Training where the model generates its own supervision from unlabeled data by hiding part of the input and predicting it. For LLMs: predict the next token.
Why it matters: The breakthrough that made modern AI possible. Unlocked training on the entire internet instead of expensive hand-labeled datasets.
A small draft model generates candidate tokens, then the large model verifies them all at once. Correct guesses (common for predictable tokens) accept multiple tokens in one step.
Why it matters: Speeds up inference 2–3x with zero quality loss — the output is mathematically identical to the large model alone. One of the few free lunches in AI.
Sending model output token by token as generated, via Server-Sent Events. This is why chat interfaces show text appearing word by word rather than all at once.
Why it matters: A response building word by word feels fine. The same response after seconds of blank screen feels broken. Streaming also lets users interrupt bad responses early.
Getting AI to respond in machine-parseable format like JSON. Most providers support this natively: define a schema, the model guarantees conformance.
Why it matters: The moment you build an application (not just a chatbot), you need structured output. Your code can't parse free-form text. This makes AI usable as a software component.
Training from labeled examples where the correct answer is provided. The model adjusts to minimize the difference between its predictions and the known answers.
Why it matters: The workhorse behind most practical ML: spam filters, medical imaging, fraud detection, and LLM fine-tuning. When you have labeled data, start here.
Training data generated by AI models rather than collected from real sources. A frontier model generates examples used to train or fine-tune other models.
Why it matters: Reshaping AI development because real labeled data is expensive. A frontier model can generate millions of examples overnight. Quality control is critical — bad synthetic data amplifies errors.
A function that converts a vector of raw numbers (logits) into a probability distribution — all values become positive and sum to 1. Softmax amplifies the differences between values: the largest input gets the highest probability, and smaller inputs get exponentially smaller probabilities. It appears in attention mechanisms, classification outputs, and token prediction.
Why it matters: Softmax is everywhere in modern AI. Every time a language model predicts the next token, softmax converts raw model outputs into probabilities. Every attention head uses softmax to compute attention weights. Every classifier uses softmax to produce class probabilities. Understanding softmax helps you understand temperature, top-p sampling, and why models are "confident" even when wrong.
The largest AI data labeling company, providing the human-annotated training data that most major AI models rely on. Scale AI labels images, text, video, and 3D data for autonomous driving, government, and AI companies. They also offer evaluation services, RLHF data collection, and data curation for fine-tuning. Major customers include OpenAI, Meta, the US Department of Defense, and numerous self-driving car companies.
Why it matters: Scale AI occupies a critical position in the AI supply chain: between raw data and trained models. The quality of labeled data directly determines model quality, and Scale is the largest provider. Their RLHF data collection services means they literally help shape how AI models are aligned — the human preferences that train Claude, GPT, and others often come through labeling platforms like Scale.
An attention mechanism where a sequence attends to itself — every token computes its relevance to every other token in the same sequence. The queries, keys, and values all come from the same input. This lets each token gather information from all other tokens, weighted by relevance. Self-attention is the core operation in every Transformer layer.
Why it matters: Self-attention is what makes Transformers work. It replaced the sequential processing of RNNs with parallel, direct connections between all positions. The word "bank" in "river bank" attends to "river" to resolve its meaning, regardless of how far apart they are. This ability to directly connect any two positions is why Transformers handle long-range dependencies so well.
A neural network trained to reconstruct a model's internal activations through a bottleneck with a sparsity constraint — only a few features can be active at once. The learned features often correspond to interpretable concepts (specific topics, linguistic patterns, reasoning strategies), making SAEs the primary tool for disentangling the superposed features inside large language models.
Why it matters: Sparse autoencoders are the microscope of mechanistic interpretability. LLMs pack thousands of features into each layer through superposition, making individual neurons uninterpretable. SAEs decompose these superposed representations into individual, interpretable features. Anthropic used SAEs to identify millions of features in Claude, including features for deception, specific concepts, and safety-relevant behaviors.
A gated activation function used in the feedforward layers of modern Transformers. SwiGLU combines the SiLU/Swish activation with a gating mechanism: SwiGLU(x) = (x · W1 · SiLU) ⊗ (x · W3), where ⊗ is element-wise multiplication. This lets the network learn what information to pass through, consistently outperforming standard ReLU or GELU feedforward layers.
Why it matters: SwiGLU is the feedforward activation used by LLaMA, Mistral, Qwen, Gemma, and most modern LLMs. Understanding it helps you read model architectures and explains why modern FFN layers have three weight matrices instead of two. It's a small architectural choice with outsized impact on model quality.
A mathematical function that squashes any real number into the range (0, 1): σ(x) = 1 / (1 + e^(−x)). Historically the default activation function in neural networks, now largely replaced by ReLU and GELU for hidden layers but still used for binary classification outputs, gating mechanisms (in LSTMs and GLU), and attention-like operations where you need values between 0 and 1.
Why it matters: Sigmoid appears everywhere in AI even though it's no longer the default hidden activation. LSTM gates use sigmoid. The SiLU/Swish activation is x · sigmoid(x). Binary classifiers use sigmoid as the output activation. Understanding sigmoid — and why it was replaced by ReLU for hidden layers — is foundational knowledge for understanding neural network design choices.
The algorithm converting raw text into tokens before the model sees it. Different models use different tokenizers — the same sentence tokenizes differently for Claude, GPT, and Llama.
Why it matters: The invisible layer between your text and the model. Determines why some languages cost more, why code uses context faster than prose, and why you hit unexpected context limits.
Using knowledge learned from one task or dataset to improve performance on a different but related task. Instead of training from scratch every time, you start with a model that already understands general patterns (language structure, visual features) and adapt it to your specific need. Pre-training then fine-tuning is the dominant paradigm in modern AI.
Why it matters: Transfer learning is why AI became practical. Training a language model from scratch costs millions of dollars. Fine-tuning a pre-trained model on your specific task costs tens of dollars and a few hours. This economics is what enabled the explosion of AI applications — you don't need Google's budget to build something useful.
The total number of tokens a system can generate per second across all concurrent requests. Distinct from latency (how fast a single request is served). A system with high throughput serves many users simultaneously. A system with low latency serves each individual user quickly. The two often trade off against each other.
Why it matters: When building AI products, throughput determines your serving costs and capacity. A system that generates 100 tokens/second per user but can only serve one user at a time has low throughput even though individual latency is great. Throughput is what you optimize when you're paying GPU bills for thousands of concurrent users.
为什么重要: Together AI是想使用开源模型的团队自托管之外的首选替代方案。无需管理自己的GPU服务器和模型服务基础设施,你只需调用他们的API就能以远低于OpenAI/Anthropic的价格使用Llama-70B或Mistral。他们代表了AI堆栈中“开源模型云”层,使开源权重模型在生产中变得可行。
A cloud platform for running and training open-source AI models. Together AI provides inference APIs for popular open models (Llama, Mistral, Qwen, etc.) at competitive prices, plus fine-tuning and custom training infrastructure. Founded by AI researchers, they also contribute to open-source research and have released their own models.
Why it matters: Together AI is the leading alternative to self-hosting for teams that want to use open models. Instead of managing your own GPU servers and model serving infrastructure, you call their API and get Llama-70B or Mistral at a fraction of OpenAI/Anthropic prices. They represent the "open model cloud" layer of the AI stack that makes open-weight models practical for production use.
Finding patterns in data without labels. Clustering, dimensionality reduction, and anomaly detection are classic tasks. The model discovers structure on its own.
Why it matters: Most real-world data is unlabeled. Unsupervised learning finds patterns impossible to discover manually. It's the basis for embeddings, semantic search, and RAG.
A Transformer architecture applied to images by splitting an image into fixed-size patches (e.g., 16×16 pixels), treating each patch as a "token," and processing the sequence of patches with standard Transformer attention. ViT (Dosovitskiy et al., 2020) showed that Transformers could match or exceed CNNs on image tasks when trained on enough data, unifying the architectures for language and vision.
Why it matters: ViT proved that the Transformer is a universal architecture — not just for text but for images too. This unification enabled the explosion of multimodal models: if images and text are both sequences of tokens processed by the same architecture, combining them becomes natural. ViT is the image encoder in CLIP, the backbone of DiT, and the foundation of modern computer vision.
An open-source LLM serving engine that achieves high throughput through PagedAttention and continuous batching. vLLM handles the complex engineering of GPU memory management, request scheduling, and KV cache optimization, providing an OpenAI-compatible API that makes it easy to self-host open models (Llama, Mistral, Qwen) in production.
Why it matters: vLLM is the most popular open-source LLM serving solution. If you're self-hosting an open model, you're probably using vLLM (or should be). Its PagedAttention innovation increased serving throughput by 2–24x compared to naive implementations. It's the infrastructure layer that makes open models practical for production use.
Wan-AI 通过发布任何人都可以运行、微调和部署且无需授权费用的开放权重模型,从根本上改变了高质量视频生成的可及性。这迫使整个视频 AI 行业重新审视闭源模型的价值主张,并加速了整个生态系统的创新。作为阿里巴巴与 Qwen 一同推进的更广泛的开源 AI 战略的一部分,Wan 提供了一个可信的论点,即大公司的开放权重发布可以匹敌甚至超越资金充足的初创公司闭门研发的产品。
Embedding invisible signals in AI-generated content for later detection. Text watermarking subtly biases token selection so detectors can statistically identify AI text.
Why it matters: As AI content becomes indistinguishable from human content, watermarking could help distinguish them at scale. Matters for misinformation, academic integrity, and provenance.
The dominant MLOps platform for tracking machine learning experiments. W&B lets you log metrics, hyperparameters, model outputs, and system performance during training, then compare runs visually. It's become the standard tool for ML researchers and engineers to track what they tried, what worked, and why — essentially version control for experiments.
Why it matters: Without experiment tracking, ML development is chaos: which hyperparameters produced that good result? Which dataset version was used? Why did training diverge? W&B solved this problem so well that it's now used by most AI labs, from solo researchers to OpenAI. If you're training models, you're almost certainly using W&B or something inspired by it.
Dense vector representations of words where words with similar meanings have similar vectors. Word2Vec (2013) and GloVe (2014) pioneered this: they train on word co-occurrence patterns to produce vectors where "king − man + woman ≈ queen." Word embeddings were the precursor to modern contextual embeddings (BERT, sentence-transformers) and remain foundational to understanding how neural networks represent language.
Why it matters: Word embeddings were the breakthrough that made neural NLP practical. Before them, words were represented as one-hot vectors (no notion of similarity). Word embeddings proved that distributed representations could capture meaning, analogy, and semantic relationships. This insight — represent discrete symbols as learned continuous vectors — is the foundation of all modern language models.
为什么重要: Windsurf代表了AI编程工具日益激烈的竞争,证明AI原生编辑器的市场足够大,可以容纳多个玩家。其用于多步骤编码任务的“Cascade”功能和免费层吸引了大量用户。Cursor vs. Windsurf vs. Copilot vs. Claude Code的竞争正在推动开发者与AI交互方式的快速创新。
Elon Musk's AI company, known for Grok models. Has access to X (Twitter) data and one of the largest GPU clusters (Colossus, 100K+ H100s).
Why it matters: Matters for its scale and unique data access. Whether the X firehose and massive compute translate into frontier-quality models is the open question.