OpenAI推出人工智能视频生成模型-Sora
Sora是由OpenAI开发的一款视频生成模型,其特点在于可以利用文本生成长达一分钟的视频。这些视频可以包含多个角色、特定类型的运动,以及精确的主题和背景细节,呈现复杂的场景。
OpenAI也指出Sora模型目前存在一些弱点。模型可能难以准确地模拟复杂场景的物理表现,也可能无法理解因果关系的具体实例。举例来说,模型可能在视频中出现一个人咬了一口饼干后,但饼干上没有咬痕。同时,模型也可能混淆一些空间细节,比如左右方向。
Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
Research techniques
Sora is a diffusion model, which generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps.
Sora 是一种扩散模型,它从看起来像静态噪声的视频开始生成视频,然后通过多个步骤消除噪声来逐渐对其进行转换。
Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.
Sora 能够一次生成整个视频或扩展生成的视频以使其更长。通过一次为模型提供多个帧的预见,我们解决了一个具有挑战性的问题,即确保主题即使暂时离开视野也保持不变。
Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.
与 GPT 模型类似,Sora 使用变压器架构,释放出卓越的扩展性能。
We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.
我们将视频和图像表示为称为补丁的较小数据单元的集合,每个补丁类似于 GPT 中的令牌。通过统一我们表示数据的方式,我们可以在比以前更广泛的视觉数据上训练扩散变换器,涵盖不同的持续时间、分辨率和纵横比。
Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully.
Sora 建立在过去对 DALL·E 和 GPT 模型的研究之上。它使用 DALL·E 3 的重述技术,该技术涉及为视觉训练数据生成高度描述性的标题。因此,该模型能够更忠实地遵循生成视频中用户的文本指令。
In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail. The model can also take an existing video and extend it or fill in missing frames.
除了能够仅根据文本指令生成视频之外,该模型还能够获取现有的静态图像并从中生成视频,准确地动画图像的内容并关注小细节。该模型还可以获取现有视频并对其进行扩展或填充缺失的帧。
Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.
Sora 是能够理解和模拟现实世界的模型的基础,我们相信这一功能将成为实现 AGI 的重要里程碑。