Commit Graph

1 Commits

Author SHA1 Message Date
chiguyong b55c896794 feat(rag_platform): U3+U7 — document processing pipeline + upload security
U3: Document processing pipeline (document_processor.py)
- DocumentProcessor class wrapping parse → segment → vectorize
- parse() uses memory/document_loader.py for multi-format extraction
- segment() uses LlamaIndex SentenceSplitter
- preview() returns chunks for read-only preview (no vectorization)
- vectorize() embeds chunks and stores in pgvector (all-or-nothing)
- process() orchestrates full pipeline with status transitions:
  pending → parsing → segmenting → vectorizing → indexed | failed

U7: Upload security & content sanitization (sanitize.py)
- ALLOWED_FILE_TYPES whitelist (pdf/docx/xlsx/pptx/txt/md/csv/html)
- MAX_FILE_SIZE 50MB limit
- validate_file_type() / validate_file_size() guards
- check_zip_bomb() for ZIP-based formats (ratio > 100:1 or > 500MB)
- check_image_bomb() for pixel count > 100MP (PNG/JPEG/GIF header parsing)
- is_safe_ip() SSRF protection (loopback/RFC1918/link-local/ULA denied)
- sanitize_markdown() removes dangerous HTML tags (script/iframe/object/embed)
- sanitize_content() main entry point for text format sanitization
- parse_xml_safe() XXE protection (forbid_dtd/forbid_entities/forbid_external)

Preview API (preview.py)
- PreviewChunk / PreviewResult Pydantic models
- generate_preview() returns read-only segmentation preview

Tests: 112 tests passing (45 new + 67 existing)
- test_sanitize.py: file type/size, markdown sanitization, SSRF, zip/image bomb
- test_document_processor.py: parse/segment, preview, vectorize, failure status
2026-06-25 11:21:42 +08:00