职贝云数AI新零售门户

标题: DeepSeek-OCR模型图片&PDF辨认 [打印本页]

作者: OZQ 时间: 2025-12-31 17:03
标题: DeepSeek-OCR模型图片&PDF辨认
接上去我们围绕DeepSeek-OCR模型的7个实践运用场景停止功能完成引见，这些场景分别是：OCR纯文字提取：支持对恣意图像停止自在式文字辨认（Free OCR），疾速提取图片中的全部文本信息，不依赖版面结构，合适截图、票据、合同片段等轻量场景的疾速文本获取。保留版面格式的OCR提取：模型可自动辨认并重建文档中的排版结构，包括段落、标题、页眉页脚、列表与多栏规划，完成“结构化文字输入”。此功能可直接将扫描文档还原为可编辑的排版文本，方便二次编辑与归档。图表 & 表格解析：DeepSeek-OCR 不只辨认文本，还能解析图像中的结构化信息，如表格、流程图、建筑平面图等，自动辨认单元格边界、字段对齐关系及数据对应结构，支持生成可机读的表格或文本描画。图片信息描画：借助其多模态了解才能，模型可以对整张图片停止语义级分析与详细描画，生成自然言语总结，适用于视觉报告生成、科研论文图像了解以及复杂视觉场景阐明。指定元素地位锁定：支持经过“视觉定位”（Grounding）功能，在图像中准确定位特定目的元素。例如，输入“Locate signature in the image”，模型即可前往签名区域的坐标，完成基于语义的图像检索与目的检测。Markdown文档转化：可将残缺的文档图像直接转换为结构化 Markdown 文本，自动辨认标题层级、段落结构、表格与列表格式，是完成文档数字化、知识库构建和多模态RAG场景的重要基础模块。

目的检测（Object Detection）：

在多模态扩展义务中，DeepSeek-OCR 还可以辨认并定位图片中的多个物体。经过输入如下提示词，模型会为每个目的生成带标签的边界框（bounding boxes），从而完成精准的视觉辨认与标注。

(, 下载次数: 2)

(, 下载次数: 2)

(, 下载次数: 2)

(, 下载次数: 2)

(, 下载次数: 2)
1. 图表类图片辨认与解析

1.1 示例图片

图1：
(, 下载次数: 2)
图2：
(, 下载次数: 2)
1.2 辨认过程

Free OCR：提取图片信息并转化为MarkDown语法文本辨认效果：图1：prompt = "<image>\nFree OCR."image_file = './pictures/图1.png' output_path = './image_output/free_OCR'res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
(, 下载次数: 2)

(, 下载次数: 3)
图2：prompt = "<image>\nFree OCR."image_file = './pictures/图1.png' output_path = './image_output/free_OCR'res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
(, 下载次数: 2)

(, 下载次数: 3)
Parse the figure：提取图片信息并转化为HTML语法文本辨认效果：图1prompt = "<image>\nParse the figure."image_file = './pictures/图1.png' output_path = './image_output/free_OCR'res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
(, 下载次数: 3)

图2prompt = "<image>\nParse the figure."image_file = './pictures/图2.png' output_path = './image_output/free_OCR'res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
(, 下载次数: 2)

(, 下载次数: 2)
OCR this image：只提取文字，不管任何格式
(, 下载次数: 2)

(, 下载次数: 2)
Describe this image in detail：采用VLM方式对图片信息停止了解和提炼
(, 下载次数: 2)

(, 下载次数: 3)
2. 可视化图片辨认

图3：
(, 下载次数: 3)
prompt = "<image>\nParse the figure."image_file = './pictures/图3.png' output_path = './image_output/free_OCR'res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
(, 下载次数: 3)

(, 下载次数: 3)
图4:
(, 下载次数: 3)
prompt = "<image>\nParse the figure."image_file = './pictures/图4.png' output_path = './image_output/free_OCR'res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
(, 下载次数: 2)

(, 下载次数: 1)
3. 公式、手写体文字辨认

(, 下载次数: 2)
prompt = "<image>\n<|grounding|>Convert the document to markdown."image_file = './pictures/图5.png' output_path = './image_output/free_OCR'res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
(, 下载次数: 2)

(, 下载次数: 3)
4. CAD图纸、装饰图、流程图辨认

(, 下载次数: 2)

(, 下载次数: 2)
=====================BASE:  torch.Size([1, 256, 1280])PATCHES:  torch.Size([4, 100, 1280])====================================这张图片是一份建筑平面图，展现了住宅建筑内两个相连的公寓或房间。图中明晰标注了不同功能区域的称号，包括卧室、起居区、卫生间、厨房以及杂物间等辅助空间。从左上角末尾：* 一个卧室，尺寸约为 3600mm × 3300mm；* 右侧相邻地位还有一间卧室，面积大致相反，约为 3600mm × 3300mm，但略小于第一间；* 在两间卧室下方分别还有附属房间，包括一个尺寸约为 1600mm × 1200mm 的卫生间，紧邻两侧卧室。在平面图接近左下方的中部区域：* 可以看到一个较大的开放空间，揣测为公共活动区域，如餐厅或文娱区，宽度大约在 2700mm 至 3400mm 之间，详细取决于各个分区的规划标示。在下半部分偏右、接近底部地方的地位：* 有一个标注为“主卧”的房间，占据了整个结构近一半的宽度，长度约为 4200mm，高度与之相近，阐明其为次要卧室，空间较为宽阔。在主卧右侧相邻地位：* 还有一个区域标注为“主卧室”，长度方向与前一个主卧相近，但略短一些，能够作为辅助卧室、书房或办公室运用——图纸中并未明白阐明其确切功能。此外，在平面图的地方区域分布着一些较小的房间，能够作为储物间（“储物间”）或小型卫生间（“卫生间”）运用。全体来看，图纸中的尺寸规划非常精细，既保证了各功能区域的合理分布，又兼顾了寓居温馨性与适用性，是古代住宅设计中常见的规划范例。==================================================图片尺寸： (1010, 904)有效图像 tokens： 629输入文本 tokens（有效）： 362紧缩比： 0.58========================保存结果:===============
(, 下载次数: 2)

(, 下载次数: 3)
=====================BASE:  torch.Size([1, 256, 1280])PATCHES:  torch.Size([3, 100, 1280])====================================一名身穿绿色长裙、披着棕色斗篷的女子站在一片昏暗的森林中。她手中握着弓箭，正将弓弦拉满，瞄准画面之外的某个目的。她的右侧是一棵高大的树木，左侧是一块宏大的岩石。树林中的树干粗壮挺拔，枝叶茂密。地面上铺满了落叶与枯枝。透过树梢，天光从上方洒落，斑驳的光线映照在林间。女子姿态放松但神情专注，目光紧锁目的，显得冷静而坚定。整幅画面是一张摄影作品，从稍微俯视的角度拍摄，呈现出安静而紧张的氛围。==================================================图片尺寸： (1586, 568)有效图像 tokens： 391输入文本 tokens（有效）： 108紧缩比： 0.28========================保存结果:===============

5. PDF转MarkDown

(, 下载次数: 2)

(, 下载次数: 2)
conda activate deepseek-ocrcd /root/autodl-tmp/test/DeepSeek-OCR/DeepSeek-OCR-master/DeepSeek-OCR-vllm
(, 下载次数: 3)

(, 下载次数: 2)

(, 下载次数: 2)

(, 下载次数: 3)
python run_dpsk_ocr_pdf.py
(, 下载次数: 2)

(, 下载次数: 2)
转化效果
(, 下载次数: 2)

(, 下载次数: 2)

(, 下载次数: 2)

(, 下载次数: 3)

(, 下载次数: 2)
进一步添加图片解析：import os, re, io, base64, requests, jsonfrom PIL import ImageDEFAULT_PROMPT = ( "You are an OCR & document understanding assistant.\n" "Analyze this image region and produce:\n" "1) ALT: a very short alt text (<=12 words).\n" "2) CAPTION: a 1-2 sentence concise caption.\n" "3) CONTENT_MD: if the image contains a table, output a clean Markdown table;" " if it contains a formula, output LaTeX ($...$ or $...$);" " otherwise provide 3-6 bullet points summarizing key content, in Markdown.\n" "Return strictly in the following format:\n" "ALT: <short alt>\n" "CAPTION: <one or two sentences>\n" "CONTENT_MD:\n" "<markdown content here>\n")IMG_PATTERN = re.compile(r'!\[[^\]]*\]$([^)]+)$')def call_deepseek-ocr_image(vllm_url, model, img_path,                   temperature=0.2, max_tokens=2048,                   prompt=DEFAULT_PROMPT): """调用 vLLM(deepseek-ocr) 停止图片解析，前往 {alt, caption, content_md}""" with Image.open(img_path) as im:       bio = io.BytesIO()       im.save(bio, format="PNG")       img_bytes = bio.getvalue() payload = {       "model": model,       "messages": [{          "role": "user",          "content": [             {"type": "text", "text": prompt},             {"type": "image_url",                "image_url": {"url": f"data:image/png;base64,{base64.b64encode(img_bytes).decode()}", "detail": "auto"}}          ]       }],       "temperature": temperature,       "max_tokens": max_tokens, } r = requests.post(vllm_url, json=payload, timeout=180) r.raise_for_status() text = r.json()["choices"][0]["message"]["content"].strip()

解析前往 alt, caption, content_md_lines = "", "", [] mode = None for line in text.splitlines():       l = line.strip()       if l.upper().startswith("ALT:"):          alt = l.split(":", 1)[1].strip()          mode = None       elif l.upper().startswith("CAPTION:"):          caption = l.split(":", 1)[1].strip()          mode = None       elif l.upper().startswith("CONTENT_MD:"):          mode = "content"       else:          if mode == "content":             content_md_lines.append(line.rstrip()) return {       "alt": alt or "Figure",       "caption": caption or alt or "",       "content_md": "\n".join(content_md_lines).strip() }def augment_markdown(md_path, out_path,                   vllm_url="http://localhost:8001/v1/chat/completions",                   model="deepseek-ocr",                   temperature=0.2, max_tokens=2048,                   image_root=".",                   cache_json=None): with open(md_path, "r", encoding="utf-8") as f:       md_lines = f.read().splitlines() cache = {} if cache_json and os.path.exists(cache_json):       try:          cache = json.load(open(cache_json, "r", encoding="utf-8"))       except Exception:          cache = {} out_lines = [] for line in md_lines:       out_lines.append(line)       m = IMG_PATTERN.search(line)       if not m:          continue       img_rel = m.group(1).strip().split("?")[0]       img_path = img_rel if os.path.isabs(img_rel) else os.path.join(image_root, img_rel)       if not os.path.exists(img_path):          out_lines.append(f"")          continue       if cache_json and img_path in cache:          result = cache[img_path]       else:          result = call_deepseek-ocr_image(vllm_url, model, img_path,                                     temperature, max_tokens)          if cache_json:             cache[img_path] = result       alt, cap, body = result["alt"], result["caption"], result["content_md"]       if cap:          out_lines.append(f"{cap}")       if body:          out_lines.append("<details><summary>解析</summary>\n")          out_lines.append(body)          out_lines.append("\n</details>") with open(out_path, "w", encoding="utf-8") as f:       f.write("\n".join(out_lines)) if cache_json:       with open(cache_json, "w", encoding="utf-8") as f:          json.dump(cache, f, ensure_ascii=False, indent=2) print(f"✅ 已写入加强后的 Markdown：{out_path}")augment_markdown( md_path="output.md",                   # 第一步生成的 md out_path="output_augmented.md",       # 加强后的 md vllm_url="http://localhost:8001/v1/chat/completions",  # 你的 vLLM 服务 model="deepseek-ocr", image_root=".",                         # 图片途径相对根目录 cache_json="image_cache.json"          # 可选，缓存文件)完成效果对比：

由此，便可完成更高精度的视觉检索。

欢迎光临职贝云数AI新零售门户 (https://www.taojin168.com/cloud/)