
DeepSeek R1 Just Got Eyes with Fireworks AI Document Inlining
By Fireworks AI|2/5/2025
DeepSeek R1, a state-of-the-art open model, is now available. Try it now or read our DeepSeek quickstart!
By Fireworks AI|2/5/2025
A smart reasoning LLM is good, but a smart reasoning VLM is better! So let’s give DeepSeek R1 eyes.
We’re excited to demonstrate how DeepSeek R1, a state-of-the-art reasoning model from DeepSeek AI, can now process and reason over both text and image inputs using the Fireworks AI Document Inlining feature. This capability extends DeepSeek R1’s powerful reasoning to multimodal analysis, opening new avenues for research and application in AI.
DeepSeek R1 has been making waves in the AI research community, consistently performing at the top of industry benchmarks and rivaling even some of the most prominent closed-source models. DeepSeek R1, developed by DeepSeek AI, is a state-of-the-art reasoning model with a massive 671 billion parameter (671B) configuration. It has demonstrated top-tier performance across various benchmarks, positioning itself as a leading open-source alternative in the AI space.
For this demo, we are running DeepSeek R1 on the Fireworks AI platform, where it operates at an impressive 100 tokens per second (100t/s) output speed. However, until now, the model's API was limited to textual reasoning—leaving a gap in scenarios that demand a visual understanding.
While DeepSeek R1 excels in complex reasoning tasks, many practical use cases, such as document analysis, research paper summarization, and multimedia content understanding, require a multimodal approach. This is where Fireworks AI Document Inlining comes into play. With this new feature, you can effortlessly extend DeepSeek R1’s capabilities into the visual domain.
The concept is deceptively simple yet incredibly powerful: by appending #transform=inline to the end of your PDF or image URL, you enable the DeepSeek R1 model to inline the document. This transforms your text-based reasoning model into a fully capable vision-language model (VLM) without any additional overhead.
For example, if you have a research paper URL like this:
simply modify it to:
This small tweak instructs the model to process and inline the visual content of the document, empowering it to analyze both the textual and visual components seamlessly.
Below, we provide a detailed technical breakdown of how to integrate and utilize the DeepSeek R1 model with the Fireworks AI API for multimodal analysis.
Before making the API call, ensure you have the necessary imports:
import json
from base64 import b64encode
from fireworks import Fireworks
from pydantic import BaseModel
Set up the Fireworks client by initializing it with your API key:
fireworks_client = Fireworks(api_key="YOUR_API_KEY")
The request is designed to handle multimodal input, combining both text and images:
response = client.chat.completions.create(
model="model-deployment-name",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://arxiv.org/pdf/2501.12948#transform=inline"
},
},
{
"type": "text",
"text": "Summarize the research paper with top 5 findings, and the main contributions of the paper"
},
],
}
],
response_format={"type": "text"}
)
Finally, handle the API response:
print(response)
The printed output contains the summary generated by DeepSeek R1, which now leverages both its reasoning and newly acquired visual analysis capabilities to deliver a comprehensive understanding of the research paper.
We tested DeepSeek R1 using Fireworks AI Document Inlining with the following two prompts:
{"text": "Summarize the research paper with top 5 findings, and the main contributions of the paper"}
Result:
<think>
Okay, so I need to summarize this research paper. Let me start by reading through the user-provided content carefully. The paper is titled "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" by DeepSeek-AI. The main idea seems to be using reinforcement learning (RL) to enhance the reasoning capabilities of large language models (LLMs) without relying on supervised fine-tuning (SFT) initially.
First, the paper introduces two models: DeepSeek-R1-Zero and DeepSeek-R1. The former is trained using RL directly on the base model, while the latter incorporates some cold-start data and a multi-stage training pipeline. The authors claim that DeepSeek-R1-Zero achieves performance comparable to OpenAI's 01-0912 and that DeepSeek-R1 matches OpenAI-01-1217 on reasoning tasks. They also mention distilling these models into smaller versions, which performed well on benchmarks.
The key contributions mentioned are using RL without SFT, developing a pipeline for DeepSeek-R1, and showing that smaller models can be empowered through distillation. The evaluations cover various reasoning tasks, knowledge benchmarks, and other applications like creative writing.
Top findings probably include the effectiveness of RL in improving reasoning, the self-evolution observed in R1-Zero, the impact of cold-start data in enhancing readability and performance, comparison with existing models like OpenAI's, and the success of distillation. The main contributions would be the RL approach without SFT, the multi-stage training pipeline, distillation results, and releasing open-source models.
Now, I need to structure the summary with five key findings and main contributions. Let me verify each point from the paper. For instance, the emergence of self-verification and reflection in R1-Zero through RL is a finding. The performance comparisons showing DeepSeek-R1's parity with OpenAI-01-1217 is another. The distillation leading to better small models than RL alone is a contribution. Also, the unsuccessful attempts section provides insights into methods that didn't work, which might be relevant but perhaps not a top finding.
Finally, make sure to highlight the benchmarks and open-source contributions. Alright, I think I have all elements needed for the summary.
</think>
Summary of "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"
{"text": "How does the R1 model perform on the benchmarks compared to other models?"}
<think>
Okay, let me try to figure out how the R1 model performs compared to other models based on the provided context. First, I'll start by skimming through the sections where benchmark results are mentioned. The paper has several sections discussing experiments and evaluations, like Section 3.1 and 3.2. In Table 4, there's a detailed comparison between DeepSeek-R1 and other models like Claude-3.5, GPT-40, DeepSeek-V3, OpenAI-01-mini, and OpenAI-01-1217. Looking at the numbers, it seems that DeepSeek-R1 matches or surpasses these models on various benchmarks. For instance, on AIME 2024, R1 scores 79.8% pass@1, which is slightly higher than OpenAI-01-1217's 79.2%. On MATH-500, it's even better with 97.3% versus 96.4%. The Codeforces rating for R1 is 2029, which is just under OpenAI's 2061 but still outperforms 96.3% of human participants. For tasks like MMLU and MMLU-Pro, R1 scores 90.8% and 84.0% respectively, trailing only slightly behind OpenAI-01-1217 but still leading others. In some areas like AlpacaEval 2.0 and ArenaHard, R1 shows strong performance with 87.6% and 92.3% win rates, which suggests excellent general capabilities beyond just reasoning. The distilled models also perform well, with the 32B and 70B versions surpassing QwQ-32B-Preview and even OpenAI-01-mini on several benchmarks. The paper mentions that distillation helps smaller models achieve competitive results without RL, indicating R1's effectiveness as a teacher model. The discussion also notes that R1 has some limitations, like sensitivity to prompts and issues with non-English languages, but overall, it's positioned as competitive with top models like OpenAI's, especially in reasoning tasks.
</think>
The DeepSeek-R1 model demonstrates competitive performance across various benchmarks compared to other state-of-the-art models, particularly matching or slightly exceeding OpenAI-01-1217 in many reasoning tasks while maintaining strong general capabilities:
Overall, DeepSeek-R1 sets new benchmarks for open-source models and matches leading proprietary models like OpenAI-01-1217 in core reasoning tasks while offering scalability via distillation. DeepSeek R1 was benchmarked against leading models, showing:
These results illustrate DeepSeek R1’s ability to match or surpass other state-of-the-art models, particularly in reasoning tasks, while also demonstrating its capability in distillation to create high-performing smaller models.
Integrating document inlining into DeepSeek R1 represents a significant advancement for AI engineers looking to build systems that require deep multimodal analysis. By seamlessly combining textual reasoning with visual comprehension, the Fireworks AI Document Inlining feature addresses critical gaps in current use-cases, such as:
The integration of Fireworks AI Document Inlining with the DeepSeek R1 model heralds a new era of multimodal reasoning. By simply appending #transform=inline to your document URLs, you unlock a vast new set of capabilities, transforming DeepSeek R1 from a powerful textual reasoning model into a comprehensive vision-language model.
For AI engineers eager to harness these capabilities, the code example provided above offers a straightforward pathway to integrate and experiment with this hybrid approach. As the AI community continues to push the boundaries of what’s possible, innovations like this pave the way for richer, more context-aware systems that are well-equipped to tackle real-world challenges.
Happy coding and exploring: fireworks.ai