freetoplayandearncryptogames| The Chinese version of Sora is here! Made by Tsinghua University! The startup behind it has raised hundreds of millions of yuan

2024年04月27日

Two months later, the Tsinghua team created a Chinese version of the Sora video model.

On April 27th, Vidu, China's first long-term, high-consistency and highly dynamic video model, was officially released at the Future artificial Intelligence Pioneer Forum of Zhongguancun (000931) Forum. The model, jointly released by Tsinghua University and large model startup Sheng Mathematical Technologies, can generate high-definition video content with a resolution of 1080p and a length of 16 seconds with one click.

A reporter from the Securities Times learned exclusively that Zhu Jun, deputy dean of the Institute of artificial Intelligence of Tsinghua University and chief scientist of Sheng Mathematical Science and Technology, said after the Vidu release: "Vidu,we doFreetoplayandearncryptogames, we did, we do togetherFreetoplayandearncryptogames! I would like to thank my friends for their persistence day and night and blossom and bear fruit in the structure of the laboratory. " According to reports, this is also the first major breakthrough video model in the world since the release of Sora by OpenAI.

Since the release of Sora, there have been teams in the industry claiming to catch up with and recreate Sora, and the Vidu team has been the first to come out in two months. The reporter combed and found that Sheng Digital Technology has a deep accumulation in the field of multimodal large model, and it is also one of the start-ups with the highest valuation of the multimodal large model track at present. At present, the company has completed three rounds of financing, the amount of financing reached hundreds of millions of yuan, investors including QiMing Venture Partners, Zhisu AI, BV Baidu Venture Capital, Jinqiu Fund and other institutions.

Calibrate Sora to generate coherent high definition video

"A ship in the studio sails to the camera", you only need to enter such a simple command, that is, you can produce a video with realistic effect and consistent lens. In the Vidu model generated video samples released by Sheng Digital Technology, the overall texture of the video is almost comparable to that of Sora.

According to Sheng Digital Technology, the Vidu model uses the team's original Diffusion and Transformer architecture U-ViT, which supports one-click generation of high-definition video content with a resolution of up to 1080p and up to 16 seconds. "the U-ViT architecture was proposed by the team as early as September 2022. The DiT architecture, which was adopted earlier than Sora, is the world's first architecture that integrates Diffusion and Transformer." Shengmaths technology says.

The reporter noted that after Sheng Digital Technology completed a new round of financing in March this year, the company publicly said that although the emergence of Sora shows that the United States is a leader in the field of multimodal large models, "but China is not completely from scratch." According to reports, Zhu Jun put forward the multimodal diffusion model UniDiffuser based on Transformer in January 2023, using U-ViT, which is completely consistent with the architecture route of Sora, the difference is that UniDiffuser is mainly used in image generation tasks, but it can also be used as a basis for the expansion of video tasks.

Based on the long-term accumulation of machine learning and multimodal large model, the team was able to break through many key technologies of long video representation and processing in a short period of two months, and successfully developed and launched the Vidu video large model. Zhu Jun said at the Vidu model release site that Vidu mainly has the following characteristics and advantages:

freetoplayandearncryptogames| The Chinese version of Sora is here! Made by Tsinghua University! The startup behind it has raised hundreds of millions of yuan

One is to simulate the real physical world, which can generate complex and detail-rich scenes, and the light and shadow effects and character expressions can conform to the real physical laws.

Second, it is imaginative and can make up scenes and imagine surrealistic images.

Third, it has a multi-lens language, which is no longer limited to fixed lenses, and can realize the dynamic switching of different shots such as long shot, close-up, mid-range, close-up and so on while following the consistency of the subject.

Fourth, it has excellent video duration, which can support video generation with a length of 16 seconds, and keep the shot and the subject consistent.

Fifth, can understand Chinese elements, can better understand the generation of pandas, dragons and other images with Chinese cultural characteristics.

Vidu generated videos of dragons, pandas and other Chinese cultural elements

The back team is from Tsinghua University and has raised hundreds of millions of yuan.

Behind Vidu is a star startup from Tsinghua University, Sheng Digital Technology.

According to public data, Sheng Mathematical Science and Technology was established in March 2023, with core members from the Institute of artificial Intelligence of Tsinghua University, dedicated to independently developing the world's leading controllable multimodal general large-scale model. The company's CEO Benshuo is a student of Tang Jiayu in the computer Science Department of Tsinghua University. The chief scientist is Zhu Jun, deputy dean of the Tsinghua Institute of artificial Intelligence, while CTO Bao Fan is a member of the research group of Professor Zhu Jun and doctoral student of the computer Science Department of Tsinghua University.

The reporter noticed that in March this year, Tang Jiayu told the media at a communication meeting that the company's large model within this year will certainly achieve the effect of the current version of Sora, "but it is hard to say whether it is three months or half a year." However, Vidu handed in an amazing examination paper ahead of time, mainly due to the fact that the team was one of the earliest teams to lay out large multimodal models in China, and had formed a deep accumulation in this field over the years.

According to Tang Jiayu, Sheng Mathematical Science and Technology currently takes the model layer and the application layer two ways to walk. On the one hand, build a low-level general large model covering text, image, video, 3D model and other multimodal capabilities, and provide model service capabilities for B-side; on the other hand, create vertical applications for image generation, video generation and other scenes, charge according to subscription and other forms, the application direction is mainly game production, film and television later content creation scenes.

The reporter combed and found that Sheng Digital Technology has received a lot of capital attention since its inception. Tianyan survey data show that Sheng Digital Technology has completed a total of three rounds of financing. In June 2023, the angel round financing of nearly 100 million yuan was completed, with investors including Ant Group, BV Baidu Ventures, Zhuoyuan Asia and Zhuoyuan Capital. In August 2023, tens of millions of RMB angel plus round financing was completed, and the investor was Jinqiu Fund. March 2024, the completion of hundreds of millions of yuan A round of financing, investors in addition to QiMing Venture Partners, Datai Capital, Zhisu AI and other new institutions, there are BV Baidu Venture Capital, Zhuoyuan Asia two old shareholders.

With the capital boost of hundreds of millions of yuan in three rounds of financing, Sheng Digital Technology has become one of the start-ups with the highest valuation in the domestic multimodal large model. According to Sheng Digital Technology, the advent of Vidu is not only another successful verification of U-ViT fusion architecture in large-scale visual tasks, but also represents its continuous innovation and leadership in the field of multimodal native large models.

Editor: Zhu Yumeng

Proofreading: ran Yanqing