geo/docs/02-模块说明/sensitive-words.md

5.3 KiB

平台敏感词库

概述

本文档描述各平台的敏感词分类和配置。

敏感词分类

SENSITIVE_WORDS

位置:backend/app/services/content/sensitive_filter.py

SENSITIVE_WORDS = {
    "politics": [
        "台湾", "西藏", "新疆", "香港", "澳门",
        "分裂", "独立", "抗议", "游行", "示威",
        "政治", "敏感词",
    ],
    "medical": [
        "药品", "治疗", "疗效", "治愈",
        "处方", "医生", "医院", "手术",
        "医疗", "敏感词",
    ],
    "finance": [
        "投资", "理财", "收益率", "回报",
        "股票", "基金", "债券", "期货",
    ],
    "adult": [
        "色情", "赌博", "毒品", "暴力",
    ],
}

SENSITIVE_CATEGORIES

位置:backend/app/services/distribution/platform_rules.py

SENSITIVE_CATEGORIES = {
    "politics": ["政治敏感词库"],
    "medical": ["医疗敏感词库"],
    "finance": ["金融敏感词库"],
    "adult": ["低俗敏感词库"],
}

平台敏感词配置

知乎 (zhihu)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics", "medical", "finance", "adult"],
    "max_tolerance": 0,   # 零容忍
    "auto_filter": True,
}

微信公众号 (wechat)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics", "medical", "finance", "adult"],
    "max_tolerance": 0,
    "auto_filter": True,
}

百家号 (baijiahao)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics", "medical", "finance", "adult"],
    "max_tolerance": 0,
    "auto_filter": True,
}

今日头条 (toutiao)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics", "medical", "finance", "adult"],
    "max_tolerance": 0,
    "auto_filter": True,
}

微博 (weibo)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics", "adult"],  # 仅检查政治和低俗
    "max_tolerance": 2,  # 允许少量出现
    "auto_filter": True,
}

小红书 (xiaohongshu)

"sensitive_words": {
    "check_required": True,
    "categories": ["adult"],  # 仅检查低俗内容
    "max_tolerance": 0,
    "auto_filter": True,
}

B站 (bilibili)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics", "adult"],
    "max_tolerance": 0,
    "auto_filter": True,
}

简书 (jianshu)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics", "adult"],
    "max_tolerance": 0,
    "auto_filter": True,
}

掘金 (juejin)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics"],  # 仅检查政治内容
    "max_tolerance": 0,
    "auto_filter": True,
}

抖音 (douyin)

"sensitive_words": {
    "check_required": True,
    "categories": ["politics", "adult"],
    "max_tolerance": 0,
    "auto_filter": True,
}

过滤机制

SensitiveFilter

位置:backend/app/services/content/sensitive_filter.py

class SensitiveFilter:
    def filter(self, content: str, platform: str) -> FilterResult:
        """
        过滤敏感词
        1. 获取平台敏感词配置
        2. 合并基础词库和自定义词库
        3. 逐个检查并替换
        """

过滤结果

@dataclass
class FilterResult:
    filtered_content: str      # 过滤后的内容
    found_words: list         # 发现的敏感词列表
    replacements: dict        # 替换映射

替换规则

  • 敏感词被替换为 * 字符
  • 替换字符数与原词长度相同

AI写作特征

AI_PATTERNS

位置:backend/app/services/distribution/platform_rules.py

AI_PATTERNS = {
    "banned_transitions": [
        "总之", "综上所述", "值得注意的是", "让我们",
        "总而言之", "不可否认", "毋庸置疑",
        "首先", "其次", "最后", "最后但同样重要",
        "换句话说", "也就是说", "更重要的是", "可以说",
    ],
    "banned_modifiers": [
        "至关重要", "不可或缺", "举足轻重", "蓬勃发展",
        "日新月异", "深远影响", "全面提升", "显著成效",
        "重大突破", "核心要素",
    ],
    "banned_structures": [
        r"第一[,、].*第二[,、].*第三",  # 对称三段式
        r"一方面[,、].*另一方面",
    ],
    "safe_patterns": [
        "根据研究表明", "调研数据显示", "经验告诉我们",
        "事实上", "说白了", "说实话", "说真的",
    ],
}

平台AI敏感度

平台 检测级别 humanization_required
知乎 high true
微信 medium true
百家号 high true
头条 high true
微博 low false
小红书 low false
B站 medium true
简书 medium true
掘金 high true
抖音 low false

自定义敏感词

支持添加自定义敏感词:

filter = SensitiveFilter()
filter.add_custom_words("custom_category", ["词1", "词2", "词3"])

检测流程

1. 获取平台配置的敏感词分类
2. 合并基础敏感词库
3. 添加自定义敏感词
4. 遍历检测
5. 替换并记录
6. 返回过滤结果