256 lines
5.3 KiB
Markdown
256 lines
5.3 KiB
Markdown
# 平台敏感词库
|
|
|
|
## 概述
|
|
|
|
本文档描述各平台的敏感词分类和配置。
|
|
|
|
## 敏感词分类
|
|
|
|
### SENSITIVE_WORDS
|
|
|
|
位置:`backend/app/services/content/sensitive_filter.py`
|
|
|
|
```python
|
|
SENSITIVE_WORDS = {
|
|
"politics": [
|
|
"台湾", "西藏", "新疆", "香港", "澳门",
|
|
"分裂", "独立", "抗议", "游行", "示威",
|
|
"政治", "敏感词",
|
|
],
|
|
"medical": [
|
|
"药品", "治疗", "疗效", "治愈",
|
|
"处方", "医生", "医院", "手术",
|
|
"医疗", "敏感词",
|
|
],
|
|
"finance": [
|
|
"投资", "理财", "收益率", "回报",
|
|
"股票", "基金", "债券", "期货",
|
|
],
|
|
"adult": [
|
|
"色情", "赌博", "毒品", "暴力",
|
|
],
|
|
}
|
|
```
|
|
|
|
### SENSITIVE_CATEGORIES
|
|
|
|
位置:`backend/app/services/distribution/platform_rules.py`
|
|
|
|
```python
|
|
SENSITIVE_CATEGORIES = {
|
|
"politics": ["政治敏感词库"],
|
|
"medical": ["医疗敏感词库"],
|
|
"finance": ["金融敏感词库"],
|
|
"adult": ["低俗敏感词库"],
|
|
}
|
|
```
|
|
|
|
## 平台敏感词配置
|
|
|
|
### 知乎 (zhihu)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics", "medical", "finance", "adult"],
|
|
"max_tolerance": 0, # 零容忍
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### 微信公众号 (wechat)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics", "medical", "finance", "adult"],
|
|
"max_tolerance": 0,
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### 百家号 (baijiahao)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics", "medical", "finance", "adult"],
|
|
"max_tolerance": 0,
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### 今日头条 (toutiao)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics", "medical", "finance", "adult"],
|
|
"max_tolerance": 0,
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### 微博 (weibo)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics", "adult"], # 仅检查政治和低俗
|
|
"max_tolerance": 2, # 允许少量出现
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### 小红书 (xiaohongshu)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["adult"], # 仅检查低俗内容
|
|
"max_tolerance": 0,
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### B站 (bilibili)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics", "adult"],
|
|
"max_tolerance": 0,
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### 简书 (jianshu)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics", "adult"],
|
|
"max_tolerance": 0,
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### 掘金 (juejin)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics"], # 仅检查政治内容
|
|
"max_tolerance": 0,
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
### 抖音 (douyin)
|
|
|
|
```python
|
|
"sensitive_words": {
|
|
"check_required": True,
|
|
"categories": ["politics", "adult"],
|
|
"max_tolerance": 0,
|
|
"auto_filter": True,
|
|
}
|
|
```
|
|
|
|
## 过滤机制
|
|
|
|
### SensitiveFilter
|
|
|
|
位置:`backend/app/services/content/sensitive_filter.py`
|
|
|
|
```python
|
|
class SensitiveFilter:
|
|
def filter(self, content: str, platform: str) -> FilterResult:
|
|
"""
|
|
过滤敏感词
|
|
1. 获取平台敏感词配置
|
|
2. 合并基础词库和自定义词库
|
|
3. 逐个检查并替换
|
|
"""
|
|
```
|
|
|
|
### 过滤结果
|
|
|
|
```python
|
|
@dataclass
|
|
class FilterResult:
|
|
filtered_content: str # 过滤后的内容
|
|
found_words: list # 发现的敏感词列表
|
|
replacements: dict # 替换映射
|
|
```
|
|
|
|
### 替换规则
|
|
|
|
- 敏感词被替换为 `*` 字符
|
|
- 替换字符数与原词长度相同
|
|
|
|
## AI写作特征
|
|
|
|
### AI_PATTERNS
|
|
|
|
位置:`backend/app/services/distribution/platform_rules.py`
|
|
|
|
```python
|
|
AI_PATTERNS = {
|
|
"banned_transitions": [
|
|
"总之", "综上所述", "值得注意的是", "让我们",
|
|
"总而言之", "不可否认", "毋庸置疑",
|
|
"首先", "其次", "最后", "最后但同样重要",
|
|
"换句话说", "也就是说", "更重要的是", "可以说",
|
|
],
|
|
"banned_modifiers": [
|
|
"至关重要", "不可或缺", "举足轻重", "蓬勃发展",
|
|
"日新月异", "深远影响", "全面提升", "显著成效",
|
|
"重大突破", "核心要素",
|
|
],
|
|
"banned_structures": [
|
|
r"第一[,、].*第二[,、].*第三", # 对称三段式
|
|
r"一方面[,、].*另一方面",
|
|
],
|
|
"safe_patterns": [
|
|
"根据研究表明", "调研数据显示", "经验告诉我们",
|
|
"事实上", "说白了", "说实话", "说真的",
|
|
],
|
|
}
|
|
```
|
|
|
|
## 平台AI敏感度
|
|
|
|
| 平台 | 检测级别 | humanization_required |
|
|
|------|----------|---------------------|
|
|
| 知乎 | high | true |
|
|
| 微信 | medium | true |
|
|
| 百家号 | high | true |
|
|
| 头条 | high | true |
|
|
| 微博 | low | false |
|
|
| 小红书 | low | false |
|
|
| B站 | medium | true |
|
|
| 简书 | medium | true |
|
|
| 掘金 | high | true |
|
|
| 抖音 | low | false |
|
|
|
|
## 自定义敏感词
|
|
|
|
支持添加自定义敏感词:
|
|
|
|
```python
|
|
filter = SensitiveFilter()
|
|
filter.add_custom_words("custom_category", ["词1", "词2", "词3"])
|
|
```
|
|
|
|
## 检测流程
|
|
|
|
```
|
|
1. 获取平台配置的敏感词分类
|
|
2. 合并基础敏感词库
|
|
3. 添加自定义敏感词
|
|
4. 遍历检测
|
|
5. 替换并记录
|
|
6. 返回过滤结果
|
|
```
|