Add community monitoring MVP
This commit is contained in:
parent
d7f8450123
commit
912057de0a
61
.codex/tasks/steam-monitor-mvp.md
Normal file
61
.codex/tasks/steam-monitor-mvp.md
Normal file
@ -0,0 +1,61 @@
|
||||
# Steam Monitor MVP
|
||||
|
||||
## 需求确认
|
||||
|
||||
- 产品:《帝国幻想乡~TOHOTOPIA》
|
||||
- Steam AppID:`3774440`
|
||||
- 信息源:Steam 评测、Steam 讨论区主题和回复
|
||||
- 刷新:每 30 分钟;第一轮全量,后续增量
|
||||
- 分类模型:OpenRouter `deepseek/deepseek-v4-pro`
|
||||
- 密钥:`.env` / 环境变量 `OPENROUTER_API_KEY`
|
||||
- Dashboard:展示分类、原始链接、是否建议回复、处理状态、制作人/处理人备注
|
||||
|
||||
## 当前计划
|
||||
|
||||
- [x] T1 建立 Python/FastAPI + SQLite MVP。
|
||||
- [x] T2 实现 Steam 评测 API 抓取。
|
||||
- [x] T3 实现 Steam 讨论区主题和回复抓取。
|
||||
- [x] T4 实现 SQLite 去重、处理状态和同步游标。
|
||||
- [x] T5 实现 OpenRouter 结构化分类。
|
||||
- [x] T6 实现 dashboard、手动同步、状态更新。
|
||||
- [x] T7 本机 smoke test 并启动局域网服务。
|
||||
- [ ] T8 接入下一个社区平台。
|
||||
|
||||
## 执行记录
|
||||
|
||||
- 2026-05-16:创建任务记录,开始项目骨架实现。
|
||||
- 2026-05-16:完成 Python/FastAPI + SQLite MVP,实现 Steam 评测、讨论区主题和回复抓取,dashboard 展示、手动同步、后台 30 分钟增量同步、处理状态更新。
|
||||
- 2026-05-16:本机 smoke test 抓取 384 条:评测 132、讨论主题 75、回复 177。未配置 `OPENROUTER_API_KEY`,模型分析按预期进入 error,配置 `.env` 后可补跑。
|
||||
- 2026-05-16:服务已启动在 `http://127.0.0.1:8000`。
|
||||
- 2026-05-16:用户补充 `.env` 后发现“补跑分析”视觉无反应。定位为旧 uvicorn 进程未读新 `.env`,且补跑接口同步等待模型调用。已改为按钮立即返回、后台每批 20 条补跑,并在 dashboard 显示“已分析 / 待补跑”。
|
||||
- 2026-05-16:服务改为局域网监听 `0.0.0.0:8000`,当前局域网地址曾检测为 `http://10.27.16.17:8000`。
|
||||
- 2026-05-16:修复讨论区排序问题。根因是 Steam 讨论区 `published_at` 未解析,已支持 `x 小时以前`、`3 月 7 日 下午 4:52`、`2025 年 8 月 9 日 下午 3:29` 并回填 252 条讨论区记录。
|
||||
- 2026-05-16:按用户要求补跑 2026-05-01 之后内容。共 209 条:评测 132、讨论主题 26、讨论回复 51,最终全部 `done`。
|
||||
- 2026-05-16:Dashboard 页头新增“最近更新时间”,优先取最近成功同步完成时间,缺失时取最新采集时间。
|
||||
|
||||
## 恢复入口
|
||||
|
||||
- 方案文档:`任务/方案/steam社区监控一期计划.md`
|
||||
- README:`README.md`
|
||||
- CLI:`python -m app.cli sync --full`、`python -m app.cli analyze-pending --since 2026-05-01 --limit 20`
|
||||
- Dashboard:`python -m uvicorn app.main:app --host 0.0.0.0 --port 8000`
|
||||
- 当前服务:局域网监听 `0.0.0.0:8000`
|
||||
|
||||
## 当前状态
|
||||
|
||||
- 已完成 Steam 一期 MVP。
|
||||
- 当前数据文件:`data/tohotopia_monitor.sqlite3`。
|
||||
- 当前 dashboard 无登录认证,局域网可访问者可查看和修改处理状态。
|
||||
- 当前排序:建议回复优先;同组内按发布时间新到旧。
|
||||
- 当前后台任务:FastAPI 启动后每 30 分钟增量同步。
|
||||
- 当前 OpenRouter key:来自 `.env` 的 `OPENROUTER_API_KEY`。
|
||||
|
||||
## 下一阶段入口
|
||||
|
||||
添加其它社区平台时:
|
||||
|
||||
- 先读 `AGENTS.md`、`README.md`、本任务文档和 `任务/方案/steam社区监控一期计划.md`。
|
||||
- 新平台采集器应输出 `app.models.RawItem`。
|
||||
- 继续复用 `raw_items`、`analysis_results`、`work_items`。
|
||||
- 新平台不要把平台私有字段直接塞到 dashboard 查询条件里;先进入 `raw_json` 和统一字段。
|
||||
- 需要登录态、API、反爬或浏览器自动化的平台,先验证当前事实再实现。
|
||||
86
.codex/tasks/twitter-monitor-mvp.md
Normal file
86
.codex/tasks/twitter-monitor-mvp.md
Normal file
@ -0,0 +1,86 @@
|
||||
# Twitter Monitor MVP
|
||||
|
||||
日期:2026-05-16
|
||||
状态:completed
|
||||
|
||||
## 背景
|
||||
|
||||
用户要求在 Steam MVP 已完成的基础上,新增 X.com/Twitter 玩家反馈采集与处理功能,目标源为 `https://x.com/Tohotopia`,采集范围为所有帖子以及所有回复,首轮全量,增量按时间,继续复用 `RawItem -> raw_items -> OpenRouter -> analysis_results -> work_items -> dashboard` 流程。
|
||||
|
||||
## 需求确认
|
||||
|
||||
- 做什么:接入 X.com/Twitter 账号 `Tohotopia` 的账号帖子和每帖回复采集,归一为 `RawItem` 并进入现有同步、分析、dashboard 流程。
|
||||
- 不做什么:不改 dashboard 的核心数据结构;不把 Twitter 私有字段提升为 dashboard 查询字段;不在未登录时伪造空结果。
|
||||
- 成功标准:本机登录态可用时,CLI/同步能采集 Twitter 帖子与回复并入库去重,新增内容可进入 OpenRouter 分析和 dashboard 展示。
|
||||
- 关键约束:X.com 当前页面/API/登录态属于动态事实,先以本机 smoke test 验证;采集失败不得删除旧数据。
|
||||
|
||||
## 文档/代码预读
|
||||
|
||||
- Project AGENTS:新渠道单独封装采集、解析、限流、登录态和失败处理;运营判断必须可追溯到平台、原始链接、采集时间或批次。
|
||||
- Relevant docs:`README.md` 和 `任务/方案/后续社区平台接入指南.md` 明确新平台采集器输出 `app.models.RawItem`,复用三层数据模型。
|
||||
- Relevant code:`app/sync.py` 当前只采 Steam;`app/models.py` 的 `RawItem` 可容纳 Twitter 数据;`app/db.py` 已有 `raw_json` 和 `(source, source_item_id)` 唯一键。
|
||||
- 已确认事实:已有 `social-media-scraper` skill 支持 X.com 用户时间线和单帖回复,通过已登录 Chrome/CDP profile 拦截 API 输出 JSON/CSV。
|
||||
- 冲突 / 歧义:用户不确定是否需要登录态;本机 smoke test 已验证当前 profile 检测到 X.com 登录提示。
|
||||
|
||||
## 术语与冲突
|
||||
|
||||
- Resolved terms:Twitter/X 平台标识在代码中使用 `twitter` 作为配置前缀;来源类型使用 `twitter_posts` 和 `twitter_replies`。
|
||||
- Conflicts:无。
|
||||
- Follow-up CONTEXT / glossary updates:暂无项目级 `CONTEXT.md`,本次术语记录在任务文档。
|
||||
|
||||
## 当前计划
|
||||
|
||||
- [x] T1 预读文档与现有 Steam 流程代码。
|
||||
- [x] T2 验证 X.com 目标页与可用采集工具前提。
|
||||
- [x] T3 制定 Twitter 接入方案和数据映射。
|
||||
- [x] T4 实现采集器与同步流程接入。
|
||||
- [x] T5 补充 CLI/配置/文档与任务记录。
|
||||
- [x] T6 运行 smoke test 验证入库、分析与 dashboard。
|
||||
|
||||
## 关键判断与证据
|
||||
|
||||
| 判断 | 类型(稳定原理/当前事实/推断) | 证据 | 验证时间 | 未验证项 | 决策影响 |
|
||||
|------|--------------------------------|------|----------|----------|----------|
|
||||
| 新平台应输出 `RawItem` 后复用同步链路 | 稳定原理 | README、后续社区平台接入指南、`app/models.py` | 2026-05-16 | 无 | 避免 dashboard 直接依赖 Twitter 私有字段 |
|
||||
| X.com 当前采集需要登录态 | 当前事实 | `social-media-scraper` 未登录提示;登录后小样本抓到 18 条 | 2026-05-16 | 全量回复数量和耗时 | 实现必须显式处理未登录失败并给出前置条件 |
|
||||
| 复用已有 CDP 采集脚本比重写 X.com API 更稳妥 | 推断 | 已有 skill 支持 UserTweets/TweetDetail;登录后项目同步入口成功入库并分析 | 2026-05-16 | 全量耗时 | 新增项目内适配层读取 JSON 并转 RawItem |
|
||||
|
||||
## 执行记录
|
||||
|
||||
- 14:00:读取 `AGENTS.md`、`README.md`、`.codex/tasks/steam-monitor-mvp.md`、`任务/方案/后续社区平台接入指南.md`,确认新平台接入规则。
|
||||
- 14:05:读取 `social-media-scraper` skill,确认 X.com 用户时间线和单帖回复已支持,输出位置可指定。
|
||||
- 14:10:运行 `python C:\Users\jiajiankun\.codex\skills\social-media-scraper\scraper.py https://x.com/Tohotopia --max-no-new 1 --output-dir 任务/验证/twitter-smoke`,结果为当前 Chrome profile 未登录 X.com。
|
||||
- 14:25:新增 `app/twitter.py`,将 `social-media-scraper` 输出的 timeline/thread JSON 转为 `RawItem`,内容类型为 `twitter_post` / `twitter_reply`,来源为 `twitter_posts` / `twitter_replies`。
|
||||
- 14:35:扩展 `app/config.py`、`app/sync.py`、`app/cli.py`、`app/main.py`,支持 `TWITTER_ENABLED`、平台级同步、Twitter 单平台 CLI、dashboard 类型筛选。
|
||||
- 14:42:更新 `.env.example`、`README.md`、`requirements.txt`,补充 Twitter 登录前提、配置和依赖。
|
||||
- 14:50:修正 Twitter 增量高水位,从“最近同步完成时间”改为“已入库 Twitter 内容的最大发布/采集时间”,避免漏掉发布时间早于同步结束时间的内容。
|
||||
- 14:55:验证 `python -m compileall app` 通过;默认配置 `python -m app.cli sync --platform twitter` 返回 `twitter_skipped=1`;临时启用 Twitter 后返回 `twitter_errors=1` 且 `sync_runs.status=partial`,未插入空 Twitter 数据。
|
||||
- 19:17:用户在 CDP Chrome profile 登录 X.com 后,运行 `social-media-scraper` 小样本验证,抓到 18 条 `Tohotopia` 时间线内容。
|
||||
- 19:21:运行项目同步小范围验证:`TWITTER_ENABLED=true`、`TWITTER_INCREMENTAL_MAX_NO_NEW=1`、`TWITTER_THREAD_MAX_NO_NEW=1`、`TWITTER_INCREMENTAL_REPLY_PARENT_LIMIT=2`、`python -m app.cli sync --platform twitter`。结果:`twitter_fetched=26`、新增 22、分析 22、已见 4。
|
||||
- 19:25:数据库确认 Twitter 已入库 18 条 `twitter_posts` 和 4 条 `twitter_replies`;最近同步 `id=12` 状态为 `success`。
|
||||
- 19:24-19:51:用户设置 `TWITTER_ENABLED=true` 后启动 `python -m app.cli sync --platform twitter --full`。命令被用户中断后仍有进程存活但 30 秒内无文件增长、CPU 几乎不变,判断为不再推进。
|
||||
- 19:53:停止残留全量进程 `pid=81152`,将 `sync_runs id=13` 从 `running` 标记为 `partial`,保留已入库数据。最终 Twitter 数据为 34 条主帖、139 条回复,共 173 条,分析状态全部 `done`。
|
||||
|
||||
## 当前状态
|
||||
|
||||
- 已完成:文档/代码预读;X.com 登录态前提验证;Twitter 采集适配层、配置、同步、CLI、dashboard 文案和文档更新;编译、未登录失败路径、登录后小范围端到端验证。
|
||||
- 阻塞:无。
|
||||
- 下一步:如需执行“所有帖子及所有回复”的首轮全量,启用 `.env` 的 `TWITTER_ENABLED=true` 后运行 `python -m app.cli sync --platform twitter --full`。
|
||||
|
||||
## 五层变更候选
|
||||
|
||||
- 无。
|
||||
|
||||
## 恢复入口
|
||||
|
||||
下次继续时先读:
|
||||
|
||||
- 关键文件:`app/twitter.py`、`app/sync.py`、`app/config.py`、`app/cli.py`、`app/main.py`。
|
||||
- 当前目标:把 `https://x.com/Tohotopia` 的帖子和回复接入现有 RawItem 流程。
|
||||
- 当前状态:实现已完成;X.com 登录态已写入 CDP profile;小范围同步成功;一次全量同步被中断后已清理残留进程并保留 173 条已分析数据。
|
||||
- 最近完成:清理全量残留进程,将 `sync_runs id=13` 标记为 partial。
|
||||
- 下一步:如需继续全量,可再次运行 `python -m app.cli sync --platform twitter --full`,现有去重会跳过已入库内容。
|
||||
- 不要做:不要把未登录导致的失败当作“无数据”;不要改 dashboard 数据模型。
|
||||
- 已改文件:`.codex/tasks/twitter-monitor-mvp.md`、`app/twitter.py`、`app/config.py`、`app/sync.py`、`app/cli.py`、`app/main.py`、`.env.example`、`README.md`、`requirements.txt`。
|
||||
- 验证结果:`python -m compileall app` 通过;默认 Twitter 未启用会跳过;未登录会 partial;登录后项目同步成功;当前 Twitter 共 34 条主帖和 139 条回复,173 条全部 `done`。
|
||||
- 当前阻塞:无。
|
||||
22
.env.example
Normal file
22
.env.example
Normal file
@ -0,0 +1,22 @@
|
||||
OPENROUTER_API_KEY=
|
||||
APP_ID=3774440
|
||||
PRODUCT_NAME=帝国幻想乡~TOHOTOPIA
|
||||
DATABASE_PATH=data/tohotopia_monitor.sqlite3
|
||||
SYNC_INTERVAL_MINUTES=30
|
||||
AUTO_SYNC_ENABLED=true
|
||||
TWITTER_ENABLED=false
|
||||
TWITTER_USERNAME=Tohotopia
|
||||
TWITTER_BROWSER_PROVIDER=existing
|
||||
TWITTER_OUTPUT_DIR=任务/社媒数据/twitter-monitor
|
||||
TWITTER_FULL_MAX_NO_NEW=6
|
||||
TWITTER_INCREMENTAL_MAX_NO_NEW=2
|
||||
TWITTER_THREAD_MAX_NO_NEW=3
|
||||
TWITTER_COMMAND_TIMEOUT_SECONDS=900
|
||||
TWITTER_FULL_REPLY_POST_LIMIT=0
|
||||
TWITTER_INCREMENTAL_REPLY_PARENT_LIMIT=20
|
||||
DISCUSSION_FULL_SCAN_MAX_PAGES=500
|
||||
DISCUSSION_INCREMENTAL_MAX_PAGES=5
|
||||
FULL_SCAN_TIME_LIMIT_SECONDS=7200
|
||||
OPENROUTER_MODEL=deepseek/deepseek-v4-pro
|
||||
OPENROUTER_REFERER=http://localhost:8000
|
||||
OPENROUTER_TITLE=TOHOTOPIA Steam Monitor
|
||||
13
.gitignore
vendored
Normal file
13
.gitignore
vendored
Normal file
@ -0,0 +1,13 @@
|
||||
.env
|
||||
.venv/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
*.pyo
|
||||
.pytest_cache/
|
||||
.mypy_cache/
|
||||
|
||||
data/
|
||||
任务/社媒数据/
|
||||
任务/验证/**/*.json
|
||||
任务/验证/**/*.csv
|
||||
任务/验证/**/*.log
|
||||
33
AGENTS.md
Normal file
33
AGENTS.md
Normal file
@ -0,0 +1,33 @@
|
||||
# AGENTS.md
|
||||
|
||||
## 项目定位
|
||||
|
||||
本项目是面向新上架独立游戏的社区监控和处理平台,用于分阶段接入社区渠道的信息采集、整理、分析和处理能力。
|
||||
|
||||
目标不是一次性做大全渠道,而是先跑通可验证的运营闭环:发现社区信息 → 归一化入库或记录 → 分析优先级 → 形成可处理事项 → 追踪处理结果。
|
||||
|
||||
## 领域边界
|
||||
|
||||
- 平台关注社区运营工作流,不只做爬虫脚本集合。
|
||||
- 社区内容处理应区分:原始内容、规范化记录、分析结论、人工处理状态。
|
||||
- 运营判断必须能追溯到来源平台、原始链接、采集时间或采集批次。
|
||||
- 新渠道接入时,先明确该渠道在运营中的用途:反馈收集、舆情监控、玩家支持、内容机会、竞品观察或发布效果追踪。
|
||||
|
||||
## 渠道接入原则
|
||||
|
||||
- 每个渠道单独封装采集、解析、限流、登录态和失败处理逻辑。
|
||||
- 渠道输出尽量归一到稳定字段,避免上层业务直接依赖页面结构或平台私有字段。
|
||||
- 同一内容的重复采集、编辑更新、删除不可见、权限变化,需要在渠道方案中显式说明处理策略。
|
||||
- 涉及外部平台当前 API、页面结构、频率限制或服务条款时,以实时验证结果为准。
|
||||
|
||||
## 数据优先级
|
||||
|
||||
优先保留能支撑运营决策和追溯的信息:
|
||||
|
||||
- 来源平台和原始链接
|
||||
- 作者标识
|
||||
- 发布时间和采集时间
|
||||
- 正文或摘要
|
||||
- 互动指标
|
||||
- 主题、情绪、问题类型或处理标签
|
||||
- 当前处理状态和负责人记录
|
||||
1
app/__init__.py
Normal file
1
app/__init__.py
Normal file
@ -0,0 +1 @@
|
||||
"""TOHOTOPIA community monitor."""
|
||||
61
app/cli.py
Normal file
61
app/cli.py
Normal file
@ -0,0 +1,61 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
|
||||
from .config import get_settings
|
||||
from .db import init_db, session
|
||||
from .sync import analyze_pending, run_sync
|
||||
|
||||
|
||||
def _platforms(value: str | None) -> list[str] | None:
|
||||
if not value:
|
||||
return None
|
||||
selected = [part.strip().lower() for part in value.split(",") if part.strip()]
|
||||
allowed = {"steam", "twitter"}
|
||||
unknown = sorted(set(selected) - allowed)
|
||||
if unknown:
|
||||
raise argparse.ArgumentTypeError(f"Unsupported platform(s): {', '.join(unknown)}")
|
||||
return selected
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="TOHOTOPIA community monitor")
|
||||
sub = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
sub.add_parser("init-db", help="Initialize SQLite database")
|
||||
|
||||
sync_parser = sub.add_parser("sync", help="Fetch community content and analyze new items")
|
||||
sync_parser.add_argument("--full", action="store_true", help="Run first full scan")
|
||||
sync_parser.add_argument(
|
||||
"--platform",
|
||||
type=_platforms,
|
||||
help="Comma-separated platform list: steam,twitter. Defaults to all enabled platforms.",
|
||||
)
|
||||
|
||||
analyze_parser = sub.add_parser("analyze-pending", help="Analyze pending/error items")
|
||||
analyze_parser.add_argument("--limit", type=int, default=50)
|
||||
analyze_parser.add_argument("--since", help="Only analyze items since YYYY-MM-DD")
|
||||
|
||||
args = parser.parse_args()
|
||||
settings = get_settings()
|
||||
with session(settings.database_path) as conn:
|
||||
init_db(conn)
|
||||
if args.command == "init-db":
|
||||
result = {"database": str(settings.database_path)}
|
||||
elif args.command == "sync":
|
||||
result = run_sync(conn, settings, full=args.full, platforms=args.platform)
|
||||
elif args.command == "analyze-pending":
|
||||
since_ts = None
|
||||
if args.since:
|
||||
parsed = time.strptime(args.since, "%Y-%m-%d")
|
||||
since_ts = int(time.mktime(parsed))
|
||||
result = analyze_pending(conn, settings, limit=args.limit, since_ts=since_ts)
|
||||
else:
|
||||
raise SystemExit(f"Unknown command: {args.command}")
|
||||
print(json.dumps(result, ensure_ascii=False, indent=2))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
94
app/config.py
Normal file
94
app/config.py
Normal file
@ -0,0 +1,94 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
import os
|
||||
|
||||
from dotenv import load_dotenv
|
||||
|
||||
|
||||
ROOT_DIR = Path(__file__).resolve().parent.parent
|
||||
load_dotenv(ROOT_DIR / ".env")
|
||||
|
||||
|
||||
def _int_env(name: str, default: int) -> int:
|
||||
value = os.getenv(name)
|
||||
if not value:
|
||||
return default
|
||||
return int(value)
|
||||
|
||||
|
||||
def _bool_env(name: str, default: bool) -> bool:
|
||||
value = os.getenv(name)
|
||||
if value is None:
|
||||
return default
|
||||
return value.strip().lower() in {"1", "true", "yes", "on"}
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Settings:
|
||||
app_id: str
|
||||
product_name: str
|
||||
database_path: Path
|
||||
sync_interval_minutes: int
|
||||
auto_sync_enabled: bool
|
||||
twitter_enabled: bool
|
||||
twitter_username: str
|
||||
twitter_scraper_path: Path
|
||||
twitter_output_dir: Path
|
||||
twitter_browser_provider: str
|
||||
twitter_full_max_no_new: int
|
||||
twitter_incremental_max_no_new: int
|
||||
twitter_thread_max_no_new: int
|
||||
twitter_command_timeout_seconds: int
|
||||
twitter_full_reply_post_limit: int
|
||||
twitter_incremental_reply_parent_limit: int
|
||||
discussion_full_scan_max_pages: int
|
||||
discussion_incremental_max_pages: int
|
||||
full_scan_time_limit_seconds: int
|
||||
openrouter_api_key: str | None
|
||||
openrouter_model: str
|
||||
openrouter_referer: str
|
||||
openrouter_title: str
|
||||
|
||||
|
||||
def get_settings() -> Settings:
|
||||
database_path = Path(os.getenv("DATABASE_PATH", "data/tohotopia_monitor.sqlite3"))
|
||||
if not database_path.is_absolute():
|
||||
database_path = ROOT_DIR / database_path
|
||||
twitter_scraper_path = Path(
|
||||
os.getenv(
|
||||
"TWITTER_SCRAPER_PATH",
|
||||
str(Path.home() / ".codex" / "skills" / "social-media-scraper" / "scraper.py"),
|
||||
)
|
||||
)
|
||||
if not twitter_scraper_path.is_absolute():
|
||||
twitter_scraper_path = ROOT_DIR / twitter_scraper_path
|
||||
twitter_output_dir = Path(os.getenv("TWITTER_OUTPUT_DIR", "任务/社媒数据/twitter-monitor"))
|
||||
if not twitter_output_dir.is_absolute():
|
||||
twitter_output_dir = ROOT_DIR / twitter_output_dir
|
||||
return Settings(
|
||||
app_id=os.getenv("APP_ID", "3774440"),
|
||||
product_name=os.getenv("PRODUCT_NAME", "帝国幻想乡~TOHOTOPIA"),
|
||||
database_path=database_path,
|
||||
sync_interval_minutes=_int_env("SYNC_INTERVAL_MINUTES", 30),
|
||||
auto_sync_enabled=_bool_env("AUTO_SYNC_ENABLED", True),
|
||||
twitter_enabled=_bool_env("TWITTER_ENABLED", False),
|
||||
twitter_username=os.getenv("TWITTER_USERNAME", "Tohotopia"),
|
||||
twitter_scraper_path=twitter_scraper_path,
|
||||
twitter_output_dir=twitter_output_dir,
|
||||
twitter_browser_provider=os.getenv("TWITTER_BROWSER_PROVIDER", "existing"),
|
||||
twitter_full_max_no_new=_int_env("TWITTER_FULL_MAX_NO_NEW", 6),
|
||||
twitter_incremental_max_no_new=_int_env("TWITTER_INCREMENTAL_MAX_NO_NEW", 2),
|
||||
twitter_thread_max_no_new=_int_env("TWITTER_THREAD_MAX_NO_NEW", 3),
|
||||
twitter_command_timeout_seconds=_int_env("TWITTER_COMMAND_TIMEOUT_SECONDS", 900),
|
||||
twitter_full_reply_post_limit=_int_env("TWITTER_FULL_REPLY_POST_LIMIT", 0),
|
||||
twitter_incremental_reply_parent_limit=_int_env("TWITTER_INCREMENTAL_REPLY_PARENT_LIMIT", 20),
|
||||
discussion_full_scan_max_pages=_int_env("DISCUSSION_FULL_SCAN_MAX_PAGES", 500),
|
||||
discussion_incremental_max_pages=_int_env("DISCUSSION_INCREMENTAL_MAX_PAGES", 5),
|
||||
full_scan_time_limit_seconds=_int_env("FULL_SCAN_TIME_LIMIT_SECONDS", 7200),
|
||||
openrouter_api_key=os.getenv("OPENROUTER_API_KEY"),
|
||||
openrouter_model=os.getenv("OPENROUTER_MODEL", "deepseek/deepseek-v4-pro"),
|
||||
openrouter_referer=os.getenv("OPENROUTER_REFERER", "http://localhost:8000"),
|
||||
openrouter_title=os.getenv("OPENROUTER_TITLE", "TOHOTOPIA Steam Monitor"),
|
||||
)
|
||||
120
app/db.py
Normal file
120
app/db.py
Normal file
@ -0,0 +1,120 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from contextlib import contextmanager
|
||||
from pathlib import Path
|
||||
import json
|
||||
import sqlite3
|
||||
from typing import Any, Iterator
|
||||
|
||||
|
||||
def connect(database_path: Path) -> sqlite3.Connection:
|
||||
database_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
conn = sqlite3.connect(database_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
conn.execute("PRAGMA journal_mode=WAL")
|
||||
conn.execute("PRAGMA foreign_keys=ON")
|
||||
return conn
|
||||
|
||||
|
||||
@contextmanager
|
||||
def session(database_path: Path) -> Iterator[sqlite3.Connection]:
|
||||
conn = connect(database_path)
|
||||
try:
|
||||
yield conn
|
||||
conn.commit()
|
||||
except Exception:
|
||||
conn.rollback()
|
||||
raise
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def init_db(conn: sqlite3.Connection) -> None:
|
||||
conn.executescript(
|
||||
"""
|
||||
CREATE TABLE IF NOT EXISTS raw_items (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
source TEXT NOT NULL,
|
||||
source_item_id TEXT NOT NULL,
|
||||
source_url TEXT NOT NULL,
|
||||
content_type TEXT NOT NULL,
|
||||
author_id TEXT,
|
||||
author_name TEXT,
|
||||
title TEXT,
|
||||
published_at INTEGER,
|
||||
published_at_text TEXT,
|
||||
collected_at INTEGER NOT NULL,
|
||||
updated_at_source INTEGER,
|
||||
content TEXT NOT NULL,
|
||||
raw_json TEXT NOT NULL,
|
||||
content_hash TEXT NOT NULL,
|
||||
analysis_status TEXT NOT NULL DEFAULT 'pending',
|
||||
UNIQUE(source, source_item_id)
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS analysis_results (
|
||||
raw_item_id INTEGER PRIMARY KEY,
|
||||
model TEXT NOT NULL,
|
||||
sentiment TEXT NOT NULL,
|
||||
is_positive INTEGER NOT NULL,
|
||||
is_negative INTEGER NOT NULL,
|
||||
has_actionable_feedback INTEGER NOT NULL,
|
||||
feedback_types TEXT NOT NULL,
|
||||
reply_recommended INTEGER NOT NULL,
|
||||
reply_priority TEXT NOT NULL,
|
||||
reply_suggestion TEXT NOT NULL,
|
||||
summary TEXT NOT NULL,
|
||||
priority TEXT NOT NULL,
|
||||
confidence REAL NOT NULL,
|
||||
reason TEXT NOT NULL,
|
||||
model_json TEXT NOT NULL,
|
||||
analyzed_at INTEGER NOT NULL,
|
||||
FOREIGN KEY(raw_item_id) REFERENCES raw_items(id) ON DELETE CASCADE
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS work_items (
|
||||
raw_item_id INTEGER PRIMARY KEY,
|
||||
status TEXT NOT NULL DEFAULT 'new',
|
||||
owner TEXT NOT NULL DEFAULT '',
|
||||
notes TEXT NOT NULL DEFAULT '',
|
||||
last_handled_at INTEGER,
|
||||
created_at INTEGER NOT NULL,
|
||||
updated_at INTEGER NOT NULL,
|
||||
FOREIGN KEY(raw_item_id) REFERENCES raw_items(id) ON DELETE CASCADE
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS sync_state (
|
||||
key TEXT PRIMARY KEY,
|
||||
value TEXT NOT NULL,
|
||||
updated_at INTEGER NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS sync_runs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
started_at INTEGER NOT NULL,
|
||||
finished_at INTEGER,
|
||||
mode TEXT NOT NULL,
|
||||
status TEXT NOT NULL,
|
||||
message TEXT NOT NULL DEFAULT '',
|
||||
stats_json TEXT NOT NULL DEFAULT '{}'
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_raw_items_collected_at ON raw_items(collected_at DESC);
|
||||
CREATE INDEX IF NOT EXISTS idx_raw_items_content_type ON raw_items(content_type);
|
||||
CREATE INDEX IF NOT EXISTS idx_raw_items_analysis_status ON raw_items(analysis_status);
|
||||
CREATE INDEX IF NOT EXISTS idx_work_items_status ON work_items(status);
|
||||
"""
|
||||
)
|
||||
|
||||
|
||||
def encode_json(value: Any) -> str:
|
||||
return json.dumps(value, ensure_ascii=False, separators=(",", ":"))
|
||||
|
||||
|
||||
def decode_json(value: str | None, default: Any = None) -> Any:
|
||||
if value is None:
|
||||
return default
|
||||
try:
|
||||
return json.loads(value)
|
||||
except json.JSONDecodeError:
|
||||
return default
|
||||
717
app/main.py
Normal file
717
app/main.py
Normal file
@ -0,0 +1,717 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from hashlib import sha1
|
||||
from html import escape
|
||||
import threading
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
from fastapi import FastAPI, Form, Query
|
||||
from fastapi.responses import HTMLResponse, RedirectResponse
|
||||
|
||||
from .config import Settings, get_settings
|
||||
from .db import decode_json, init_db, session
|
||||
from .models import RawItem
|
||||
from .openrouter import OpenRouterClient
|
||||
from .sync import analyze_pending, run_sync, save_analysis, upsert_raw_item
|
||||
|
||||
|
||||
app = FastAPI(title="TOHOTOPIA Steam Monitor")
|
||||
sync_lock = threading.Lock()
|
||||
analysis_lock = threading.Lock()
|
||||
stop_event = threading.Event()
|
||||
|
||||
|
||||
def current_settings() -> Settings:
|
||||
return get_settings()
|
||||
|
||||
|
||||
def _fmt_ts(value: int | None) -> str:
|
||||
if not value:
|
||||
return ""
|
||||
return time.strftime("%Y-%m-%d %H:%M", time.localtime(int(value)))
|
||||
|
||||
|
||||
def _badge(text: str, cls: str = "") -> str:
|
||||
return f'<span class="badge {cls}">{escape(text)}</span>'
|
||||
|
||||
|
||||
def _manual_item_id(source_url: str, source_name: str, title: str, author_name: str, content: str) -> str:
|
||||
seed = source_url.strip() or "\n".join(
|
||||
[source_name.strip(), title.strip(), author_name.strip(), content.strip()]
|
||||
)
|
||||
return sha1(seed.encode("utf-8", errors="ignore")).hexdigest()
|
||||
|
||||
|
||||
def _looks_chinese(text: str) -> bool:
|
||||
letters = [char for char in text if char.isalpha()]
|
||||
if not letters:
|
||||
return True
|
||||
cjk_count = sum(1 for char in letters if "\u4e00" <= char <= "\u9fff")
|
||||
return cjk_count / len(letters) >= 0.2
|
||||
|
||||
|
||||
def _query(filters: dict[str, str]) -> tuple[str, list[Any]]:
|
||||
where = []
|
||||
params: list[Any] = []
|
||||
if filters.get("content_type"):
|
||||
where.append("r.content_type = ?")
|
||||
params.append(filters["content_type"])
|
||||
if filters.get("sentiment"):
|
||||
where.append("a.sentiment = ?")
|
||||
params.append(filters["sentiment"])
|
||||
if filters.get("status"):
|
||||
where.append("w.status = ?")
|
||||
params.append(filters["status"])
|
||||
if filters.get("reply") == "1":
|
||||
where.append("a.reply_recommended = 1")
|
||||
if filters.get("actionable") == "1":
|
||||
where.append("a.has_actionable_feedback = 1")
|
||||
if filters.get("q"):
|
||||
where.append("(r.content LIKE ? OR r.title LIKE ? OR a.summary LIKE ?)")
|
||||
like = f"%{filters['q']}%"
|
||||
params.extend([like, like, like])
|
||||
clause = "WHERE " + " AND ".join(where) if where else ""
|
||||
return clause, params
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
def startup() -> None:
|
||||
settings = current_settings()
|
||||
with session(settings.database_path) as conn:
|
||||
init_db(conn)
|
||||
if settings.auto_sync_enabled:
|
||||
thread = threading.Thread(target=_sync_loop, name="steam-sync-loop", daemon=True)
|
||||
thread.start()
|
||||
|
||||
|
||||
@app.on_event("shutdown")
|
||||
def shutdown() -> None:
|
||||
stop_event.set()
|
||||
|
||||
|
||||
def _sync_loop() -> None:
|
||||
settings = current_settings()
|
||||
interval_seconds = max(settings.sync_interval_minutes, 1) * 60
|
||||
while not stop_event.wait(interval_seconds):
|
||||
if not sync_lock.acquire(blocking=False):
|
||||
continue
|
||||
try:
|
||||
with session(settings.database_path) as conn:
|
||||
run_sync(conn, settings, full=False)
|
||||
except Exception:
|
||||
# Sync failures are recorded in sync_runs by run_sync when possible.
|
||||
pass
|
||||
finally:
|
||||
sync_lock.release()
|
||||
|
||||
|
||||
@app.get("/", response_class=HTMLResponse)
|
||||
def index(
|
||||
content_type: str = Query(""),
|
||||
sentiment: str = Query(""),
|
||||
status: str = Query(""),
|
||||
reply: str = Query(""),
|
||||
actionable: str = Query(""),
|
||||
q: str = Query(""),
|
||||
manual: str = Query(""),
|
||||
notice: str = Query(""),
|
||||
) -> str:
|
||||
settings = current_settings()
|
||||
filters = {
|
||||
"content_type": content_type,
|
||||
"sentiment": sentiment,
|
||||
"status": status,
|
||||
"reply": reply,
|
||||
"actionable": actionable,
|
||||
"q": q,
|
||||
}
|
||||
with session(settings.database_path) as conn:
|
||||
clause, params = _query(filters)
|
||||
rows = conn.execute(
|
||||
f"""
|
||||
SELECT r.*, a.sentiment, a.is_positive, a.is_negative,
|
||||
a.has_actionable_feedback, a.feedback_types, a.reply_recommended,
|
||||
a.reply_priority, a.reply_suggestion, a.summary, a.priority,
|
||||
a.confidence, a.reason, w.status, w.owner, w.notes
|
||||
FROM raw_items r
|
||||
LEFT JOIN analysis_results a ON a.raw_item_id = r.id
|
||||
LEFT JOIN work_items w ON w.raw_item_id = r.id
|
||||
{clause}
|
||||
ORDER BY
|
||||
COALESCE(a.reply_recommended, 0) DESC,
|
||||
COALESCE(r.published_at, r.collected_at) DESC,
|
||||
r.collected_at DESC,
|
||||
r.id DESC
|
||||
LIMIT 200
|
||||
""",
|
||||
params,
|
||||
).fetchall()
|
||||
metrics = conn.execute(
|
||||
"""
|
||||
SELECT
|
||||
COUNT(*) AS total,
|
||||
SUM(CASE WHEN w.status = 'new' THEN 1 ELSE 0 END) AS new_count,
|
||||
SUM(CASE WHEN a.is_negative = 1 THEN 1 ELSE 0 END) AS negative_count,
|
||||
SUM(CASE WHEN a.has_actionable_feedback = 1 THEN 1 ELSE 0 END) AS actionable_count,
|
||||
SUM(CASE WHEN a.reply_recommended = 1 THEN 1 ELSE 0 END) AS reply_count,
|
||||
SUM(CASE WHEN a.priority = 'high' THEN 1 ELSE 0 END) AS high_count,
|
||||
SUM(CASE WHEN r.analysis_status = 'done' THEN 1 ELSE 0 END) AS analyzed_count,
|
||||
SUM(CASE WHEN r.analysis_status = 'pending' THEN 1 ELSE 0 END) AS pending_count,
|
||||
SUM(CASE WHEN r.analysis_status = 'error' THEN 1 ELSE 0 END) AS error_count
|
||||
FROM raw_items r
|
||||
LEFT JOIN analysis_results a ON a.raw_item_id = r.id
|
||||
LEFT JOIN work_items w ON w.raw_item_id = r.id
|
||||
"""
|
||||
).fetchone()
|
||||
last_runs = conn.execute(
|
||||
"SELECT * FROM sync_runs ORDER BY started_at DESC LIMIT 5"
|
||||
).fetchall()
|
||||
last_success = conn.execute(
|
||||
"""
|
||||
SELECT finished_at FROM sync_runs
|
||||
WHERE status = 'success' AND finished_at IS NOT NULL
|
||||
ORDER BY finished_at DESC
|
||||
LIMIT 1
|
||||
"""
|
||||
).fetchone()
|
||||
latest_collected = conn.execute(
|
||||
"SELECT MAX(collected_at) AS collected_at FROM raw_items"
|
||||
).fetchone()
|
||||
|
||||
items_html = "\n".join(_render_item(row) for row in rows)
|
||||
runs_html = "\n".join(
|
||||
f"<li>{_fmt_ts(run['started_at'])} {escape(run['mode'])} "
|
||||
f"{escape(run['status'])} {escape(run['stats_json'] or '')} {escape(run['message'] or '')}</li>"
|
||||
for run in last_runs
|
||||
)
|
||||
return f"""
|
||||
<!doctype html>
|
||||
<html lang="zh-CN">
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>{escape(settings.product_name)} 社区监控</title>
|
||||
<style>{CSS}</style>
|
||||
</head>
|
||||
<body>
|
||||
<header>
|
||||
<div>
|
||||
<h1>{escape(settings.product_name)} 社区监控</h1>
|
||||
<p>Steam 与社区平台内容,每 {settings.sync_interval_minutes} 分钟刷新</p>
|
||||
<p>最近更新时间:{_last_update_text(last_success, latest_collected)}</p>
|
||||
</div>
|
||||
<div class="actions">
|
||||
<form method="post" action="/sync"><button>增量同步</button></form>
|
||||
<form method="post" action="/sync?full=1"><button class="secondary">全量同步</button></form>
|
||||
<form method="post" action="/analyze-pending"><button class="secondary">补跑分析</button></form>
|
||||
<a class="button secondary" href="/?manual=1">手动添加</a>
|
||||
</div>
|
||||
</header>
|
||||
<section class="metrics">
|
||||
{_metric("总内容", metrics["total"])}
|
||||
{_metric("未处理", metrics["new_count"])}
|
||||
{_metric("差评/负面", metrics["negative_count"])}
|
||||
{_metric("具体反馈", metrics["actionable_count"])}
|
||||
{_metric("建议回复", metrics["reply_count"])}
|
||||
{_metric("高优先级", metrics["high_count"])}
|
||||
{_metric("已分析", metrics["analyzed_count"])}
|
||||
{_metric("待补跑", (metrics["pending_count"] or 0) + (metrics["error_count"] or 0))}
|
||||
</section>
|
||||
{f'<div class="notice">{escape(notice)}</div>' if notice else ''}
|
||||
{_render_manual_form() if manual == '1' else ''}
|
||||
<form class="filters" method="get">
|
||||
{_select("content_type", content_type, {"": "全部类型", "review": "Steam 评测", "discussion_topic": "Steam 帖子", "discussion_reply": "Steam 回复", "twitter_post": "Twitter 帖子", "twitter_reply": "Twitter 回复", "manual_note": "手动添加"})}
|
||||
{_select("sentiment", sentiment, {"": "全部情绪", "positive": "正面", "negative": "负面", "mixed": "混合", "neutral": "中性"})}
|
||||
{_select("status", status, {"": "全部状态", "new": "未处理", "read": "已读", "needs_reply": "待回复", "replied": "已回复", "needs_fix": "待修复", "archived": "已归档"})}
|
||||
<label><input type="checkbox" name="reply" value="1" {'checked' if reply == '1' else ''}> 建议回复</label>
|
||||
<label><input type="checkbox" name="actionable" value="1" {'checked' if actionable == '1' else ''}> 具体反馈</label>
|
||||
<input name="q" placeholder="搜索正文/摘要" value="{escape(q)}">
|
||||
<button>筛选</button>
|
||||
</form>
|
||||
<main>{items_html or '<div class="empty">暂无数据。先运行同步。</div>'}</main>
|
||||
<aside>
|
||||
<h2>最近同步</h2>
|
||||
<ul>{runs_html or '<li>暂无同步记录</li>'}</ul>
|
||||
</aside>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
|
||||
@app.post("/sync")
|
||||
def sync(full: int = Query(0)) -> RedirectResponse:
|
||||
if sync_lock.acquire(blocking=False):
|
||||
thread = threading.Thread(target=_run_sync_background, args=(bool(full),), daemon=True)
|
||||
thread.start()
|
||||
return RedirectResponse("/?notice=同步已在后台开始,稍后刷新查看结果", status_code=303)
|
||||
return RedirectResponse("/?notice=已有同步任务正在运行", status_code=303)
|
||||
|
||||
|
||||
@app.post("/analyze-pending")
|
||||
def analyze() -> RedirectResponse:
|
||||
if analysis_lock.acquire(blocking=False):
|
||||
thread = threading.Thread(target=_run_analysis_background, kwargs={"limit": 20}, daemon=True)
|
||||
thread.start()
|
||||
return RedirectResponse("/?notice=补跑分析已在后台开始,每批最多 20 条,稍后刷新查看结果", status_code=303)
|
||||
return RedirectResponse("/?notice=已有补跑分析正在运行", status_code=303)
|
||||
|
||||
|
||||
@app.post("/manual-items")
|
||||
def create_manual_item(
|
||||
source_name: str = Form(...),
|
||||
source_url: str = Form(""),
|
||||
title: str = Form(""),
|
||||
author_name: str = Form(""),
|
||||
published_at_text: str = Form(""),
|
||||
content: str = Form(...),
|
||||
status: str = Form("new"),
|
||||
owner: str = Form(""),
|
||||
notes: str = Form(""),
|
||||
) -> RedirectResponse:
|
||||
source_name = source_name.strip()
|
||||
source_url = source_url.strip()
|
||||
title = title.strip()
|
||||
author_name = author_name.strip()
|
||||
published_at_text = published_at_text.strip()
|
||||
content = content.strip()
|
||||
status = status if status in _work_status_options() else "new"
|
||||
|
||||
if not source_name or not content:
|
||||
return RedirectResponse("/?manual=1¬ice=来源社群和正文不能为空", status_code=303)
|
||||
|
||||
original_content = content
|
||||
translated = False
|
||||
analysis_error = ""
|
||||
settings = current_settings()
|
||||
analyzer = OpenRouterClient(settings)
|
||||
try:
|
||||
if not _looks_chinese(content):
|
||||
content = analyzer.translate_to_chinese(content)
|
||||
translated = content != original_content
|
||||
except Exception as exc: # noqa: BLE001 - keep manual entry even if translation fails
|
||||
analysis_error = f"翻译失败,已保留原文并标记待补跑:{exc}"
|
||||
|
||||
item = RawItem(
|
||||
source="manual",
|
||||
source_item_id=_manual_item_id(source_url, source_name, title, author_name, content),
|
||||
source_url=source_url,
|
||||
content_type="manual_note",
|
||||
author_id=None,
|
||||
author_name=author_name or source_name,
|
||||
title=title or f"{source_name} 手动信息",
|
||||
published_at=None,
|
||||
published_at_text=published_at_text,
|
||||
updated_at_source=None,
|
||||
content=content,
|
||||
raw={
|
||||
"source_name": source_name,
|
||||
"source_url": source_url,
|
||||
"title": title,
|
||||
"author_name": author_name,
|
||||
"published_at_text": published_at_text,
|
||||
"original_content": original_content,
|
||||
"translated_to_zh": translated,
|
||||
"manual": True,
|
||||
},
|
||||
)
|
||||
now = int(time.time())
|
||||
try:
|
||||
with session(settings.database_path) as conn:
|
||||
raw_item_id, inserted = upsert_raw_item(conn, item)
|
||||
conn.execute(
|
||||
"""
|
||||
UPDATE work_items
|
||||
SET status = ?, owner = ?, notes = ?, updated_at = ?,
|
||||
last_handled_at = CASE WHEN ? != 'new' THEN ? ELSE last_handled_at END
|
||||
WHERE raw_item_id = ?
|
||||
""",
|
||||
(status, owner.strip(), notes.strip(), now, status, now, raw_item_id),
|
||||
)
|
||||
if not analysis_error:
|
||||
try:
|
||||
analysis = analyzer.analyze(item)
|
||||
save_analysis(conn, raw_item_id, settings.openrouter_model, analysis)
|
||||
except Exception as exc: # noqa: BLE001 - keep pending/error for analyze-pending
|
||||
analysis_error = f"分析失败,已标记待补跑:{exc}"
|
||||
conn.execute(
|
||||
"UPDATE raw_items SET analysis_status = 'error' WHERE id = ?",
|
||||
(raw_item_id,),
|
||||
)
|
||||
finally:
|
||||
analyzer.close()
|
||||
|
||||
parts = ["已添加手动信息" if inserted else "已更新同来源手动信息"]
|
||||
if translated:
|
||||
parts.append("已翻译成中文")
|
||||
if analysis_error:
|
||||
parts.append(analysis_error)
|
||||
else:
|
||||
parts.append("已生成是否回复和回复建议")
|
||||
notice = ",".join(parts)
|
||||
return RedirectResponse(f"/?notice={notice}", status_code=303)
|
||||
|
||||
|
||||
@app.post("/items/{raw_item_id}/work")
|
||||
def update_work(
|
||||
raw_item_id: int,
|
||||
status: str = Form(...),
|
||||
owner: str = Form(""),
|
||||
notes: str = Form(""),
|
||||
) -> RedirectResponse:
|
||||
settings = current_settings()
|
||||
now = int(time.time())
|
||||
with session(settings.database_path) as conn:
|
||||
conn.execute(
|
||||
"""
|
||||
UPDATE work_items
|
||||
SET status = ?, owner = ?, notes = ?, updated_at = ?,
|
||||
last_handled_at = CASE WHEN ? != 'new' THEN ? ELSE last_handled_at END
|
||||
WHERE raw_item_id = ?
|
||||
""",
|
||||
(status, owner, notes, now, status, now, raw_item_id),
|
||||
)
|
||||
return RedirectResponse("/", status_code=303)
|
||||
|
||||
|
||||
def _run_sync_background(full: bool) -> None:
|
||||
settings = current_settings()
|
||||
try:
|
||||
with session(settings.database_path) as conn:
|
||||
run_sync(conn, settings, full=full)
|
||||
finally:
|
||||
sync_lock.release()
|
||||
|
||||
|
||||
def _run_analysis_background(limit: int) -> None:
|
||||
settings = current_settings()
|
||||
try:
|
||||
with session(settings.database_path) as conn:
|
||||
analyze_pending(conn, settings, limit=limit)
|
||||
finally:
|
||||
analysis_lock.release()
|
||||
|
||||
|
||||
def _notice_text(stats: dict[str, Any]) -> str:
|
||||
if not stats:
|
||||
return "无待处理项目"
|
||||
return ",".join(f"{key}={value}" for key, value in stats.items())
|
||||
|
||||
|
||||
def _last_update_text(last_success: Any, latest_collected: Any) -> str:
|
||||
if last_success and last_success["finished_at"]:
|
||||
return _fmt_ts(last_success["finished_at"])
|
||||
if latest_collected and latest_collected["collected_at"]:
|
||||
return _fmt_ts(latest_collected["collected_at"])
|
||||
return "暂无"
|
||||
|
||||
|
||||
def _metric(label: str, value: Any) -> str:
|
||||
return f'<div class="metric"><span>{escape(label)}</span><strong>{int(value or 0)}</strong></div>'
|
||||
|
||||
|
||||
def _select(name: str, current: str, options: dict[str, str]) -> str:
|
||||
option_html = "".join(
|
||||
f'<option value="{escape(value)}" {"selected" if value == current else ""}>{escape(label)}</option>'
|
||||
for value, label in options.items()
|
||||
)
|
||||
return f'<select name="{escape(name)}">{option_html}</select>'
|
||||
|
||||
|
||||
def _work_status_options() -> dict[str, str]:
|
||||
return {
|
||||
"new": "未处理",
|
||||
"read": "已读",
|
||||
"needs_reply": "待回复",
|
||||
"replied": "已回复",
|
||||
"needs_fix": "待修复",
|
||||
"archived": "已归档",
|
||||
}
|
||||
|
||||
|
||||
def _render_manual_form() -> str:
|
||||
return f"""
|
||||
<section class="manual-panel">
|
||||
<h2>手动添加社区信息</h2>
|
||||
<form class="manual-form" method="post" action="/manual-items">
|
||||
<input name="source_name" placeholder="来源社群/平台,例如 Discord、小红书、QQ群" required>
|
||||
<input name="source_url" placeholder="原始链接,可留空">
|
||||
<input name="title" placeholder="标题,可留空">
|
||||
<input name="author_name" placeholder="作者/昵称,可留空">
|
||||
<input name="published_at_text" placeholder="发布时间文本,可留空">
|
||||
<textarea name="content" placeholder="正文/摘要" required></textarea>
|
||||
{_select("status", "new", _work_status_options())}
|
||||
<input name="owner" placeholder="制作人/处理人">
|
||||
<input name="notes" placeholder="备注">
|
||||
<button>添加</button>
|
||||
</form>
|
||||
</section>
|
||||
"""
|
||||
|
||||
|
||||
def _render_item(row: Any) -> str:
|
||||
feedback_types = ", ".join(decode_json(row["feedback_types"], [])) if row["feedback_types"] else ""
|
||||
cls = "item urgent" if row["reply_recommended"] or row["priority"] == "high" else "item"
|
||||
badges = [
|
||||
_badge(row["content_type"] or "", "type"),
|
||||
_badge(row["sentiment"] or "pending", row["sentiment"] or ""),
|
||||
_badge(row["priority"] or "low", "priority"),
|
||||
]
|
||||
if row["has_actionable_feedback"]:
|
||||
badges.append(_badge("具体反馈", "action"))
|
||||
if row["reply_recommended"]:
|
||||
badges.append(_badge("建议回复", "reply"))
|
||||
content = escape(row["content"] or "")
|
||||
if len(content) > 900:
|
||||
content = content[:900] + "..."
|
||||
return f"""
|
||||
<article class="{cls}">
|
||||
<div class="item-head">
|
||||
<div>
|
||||
<h2>{escape(row['summary'] or row['title'] or '未分析')}</h2>
|
||||
<div class="meta">{' '.join(badges)} <span>{escape(row['author_name'] or '')}</span> <span>{_fmt_ts(row['published_at']) or escape(row['published_at_text'] or '')}</span></div>
|
||||
</div>
|
||||
{_source_link(row['source_url'])}
|
||||
</div>
|
||||
<p class="content">{content}</p>
|
||||
<p class="reason">{escape(row['reason'] or '')}</p>
|
||||
<p class="reply-suggestion">{escape(row['reply_suggestion'] or '')}</p>
|
||||
<p class="types">{escape(feedback_types)}</p>
|
||||
<form class="work" method="post" action="/items/{row['id']}/work">
|
||||
{_select("status", row["status"] or "new", _work_status_options())}
|
||||
<input name="owner" placeholder="制作人/处理人" value="{escape(row['owner'] or '')}">
|
||||
<input name="notes" placeholder="备注" value="{escape(row['notes'] or '')}">
|
||||
<button>保存</button>
|
||||
</form>
|
||||
</article>
|
||||
"""
|
||||
|
||||
|
||||
def _source_link(source_url: str | None) -> str:
|
||||
if not source_url:
|
||||
return '<span class="source muted">无原始链接</span>'
|
||||
if not source_url.startswith(("http://", "https://")):
|
||||
return f'<span class="source muted">{escape(source_url)}</span>'
|
||||
return (
|
||||
f'<a class="source" href="{escape(source_url)}" target="_blank" '
|
||||
f'rel="noreferrer">原始链接</a>'
|
||||
)
|
||||
|
||||
|
||||
CSS = """
|
||||
:root {
|
||||
color-scheme: light;
|
||||
font-family: Inter, "Segoe UI", "Microsoft YaHei", sans-serif;
|
||||
background: #f6f7f9;
|
||||
color: #1f2933;
|
||||
}
|
||||
body {
|
||||
margin: 0;
|
||||
}
|
||||
header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
gap: 24px;
|
||||
align-items: center;
|
||||
padding: 24px 32px;
|
||||
background: #ffffff;
|
||||
border-bottom: 1px solid #d9dee7;
|
||||
}
|
||||
h1 {
|
||||
margin: 0 0 4px;
|
||||
font-size: 24px;
|
||||
}
|
||||
p {
|
||||
line-height: 1.5;
|
||||
}
|
||||
header p {
|
||||
margin: 0;
|
||||
color: #64748b;
|
||||
}
|
||||
.actions {
|
||||
display: flex;
|
||||
gap: 8px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
button, .button, select, input, textarea {
|
||||
min-height: 36px;
|
||||
border: 1px solid #cbd5e1;
|
||||
border-radius: 6px;
|
||||
padding: 0 12px;
|
||||
background: #fff;
|
||||
font: inherit;
|
||||
}
|
||||
button, .button {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
background: #166534;
|
||||
color: white;
|
||||
border-color: #166534;
|
||||
cursor: pointer;
|
||||
text-decoration: none;
|
||||
}
|
||||
button.secondary, .button.secondary {
|
||||
background: #334155;
|
||||
border-color: #334155;
|
||||
}
|
||||
.metrics {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(6, minmax(120px, 1fr));
|
||||
gap: 12px;
|
||||
padding: 18px 32px;
|
||||
}
|
||||
.metric {
|
||||
background: #fff;
|
||||
border: 1px solid #d9dee7;
|
||||
border-radius: 8px;
|
||||
padding: 14px;
|
||||
}
|
||||
.metric span {
|
||||
display: block;
|
||||
color: #64748b;
|
||||
font-size: 13px;
|
||||
}
|
||||
.metric strong {
|
||||
display: block;
|
||||
font-size: 26px;
|
||||
margin-top: 6px;
|
||||
}
|
||||
.filters {
|
||||
display: flex;
|
||||
gap: 10px;
|
||||
flex-wrap: wrap;
|
||||
align-items: center;
|
||||
padding: 0 32px 18px;
|
||||
}
|
||||
.manual-panel {
|
||||
margin: 0 32px 18px;
|
||||
padding: 18px;
|
||||
border: 1px solid #d9dee7;
|
||||
border-radius: 8px;
|
||||
background: #fff;
|
||||
}
|
||||
.manual-panel h2 {
|
||||
margin: 0 0 12px;
|
||||
font-size: 17px;
|
||||
}
|
||||
.manual-form {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(3, minmax(160px, 1fr));
|
||||
gap: 10px;
|
||||
}
|
||||
.manual-form textarea {
|
||||
grid-column: 1 / -1;
|
||||
min-height: 120px;
|
||||
padding: 10px 12px;
|
||||
resize: vertical;
|
||||
}
|
||||
.notice {
|
||||
margin: 0 32px 18px;
|
||||
padding: 12px 14px;
|
||||
border: 1px solid #86efac;
|
||||
border-radius: 8px;
|
||||
background: #f0fdf4;
|
||||
color: #166534;
|
||||
}
|
||||
main {
|
||||
display: grid;
|
||||
gap: 14px;
|
||||
padding: 0 32px 24px;
|
||||
}
|
||||
.item {
|
||||
background: #fff;
|
||||
border: 1px solid #d9dee7;
|
||||
border-radius: 8px;
|
||||
padding: 18px;
|
||||
}
|
||||
.item.urgent {
|
||||
border-color: #dc2626;
|
||||
box-shadow: inset 4px 0 0 #dc2626;
|
||||
}
|
||||
.item-head {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
gap: 16px;
|
||||
align-items: flex-start;
|
||||
}
|
||||
.item h2 {
|
||||
margin: 0 0 8px;
|
||||
font-size: 17px;
|
||||
}
|
||||
.meta {
|
||||
display: flex;
|
||||
gap: 8px;
|
||||
align-items: center;
|
||||
flex-wrap: wrap;
|
||||
color: #64748b;
|
||||
font-size: 13px;
|
||||
}
|
||||
.badge {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
min-height: 24px;
|
||||
padding: 0 8px;
|
||||
border-radius: 999px;
|
||||
background: #e2e8f0;
|
||||
color: #334155;
|
||||
}
|
||||
.badge.negative, .badge.reply {
|
||||
background: #fee2e2;
|
||||
color: #991b1b;
|
||||
}
|
||||
.badge.positive {
|
||||
background: #dcfce7;
|
||||
color: #166534;
|
||||
}
|
||||
.badge.action {
|
||||
background: #fef3c7;
|
||||
color: #92400e;
|
||||
}
|
||||
.source {
|
||||
color: #166534;
|
||||
white-space: nowrap;
|
||||
}
|
||||
.source.muted {
|
||||
color: #64748b;
|
||||
}
|
||||
.content {
|
||||
white-space: pre-wrap;
|
||||
}
|
||||
.reason, .reply-suggestion, .types {
|
||||
color: #475569;
|
||||
margin: 8px 0;
|
||||
}
|
||||
.reply-suggestion {
|
||||
font-weight: 600;
|
||||
}
|
||||
.work {
|
||||
display: grid;
|
||||
grid-template-columns: 150px minmax(140px, 220px) 1fr 80px;
|
||||
gap: 8px;
|
||||
margin-top: 12px;
|
||||
}
|
||||
aside {
|
||||
padding: 0 32px 32px;
|
||||
color: #475569;
|
||||
}
|
||||
.empty {
|
||||
background: #fff;
|
||||
border: 1px solid #d9dee7;
|
||||
border-radius: 8px;
|
||||
padding: 32px;
|
||||
}
|
||||
@media (max-width: 900px) {
|
||||
header, .item-head {
|
||||
flex-direction: column;
|
||||
}
|
||||
.metrics {
|
||||
grid-template-columns: repeat(2, minmax(120px, 1fr));
|
||||
}
|
||||
.work {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
.manual-form {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
}
|
||||
"""
|
||||
20
app/models.py
Normal file
20
app/models.py
Normal file
@ -0,0 +1,20 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class RawItem:
|
||||
source: str
|
||||
source_item_id: str
|
||||
source_url: str
|
||||
content_type: str
|
||||
author_id: str | None
|
||||
author_name: str | None
|
||||
title: str | None
|
||||
published_at: int | None
|
||||
published_at_text: str | None
|
||||
updated_at_source: int | None
|
||||
content: str
|
||||
raw: dict[str, Any]
|
||||
238
app/openrouter.py
Normal file
238
app/openrouter.py
Normal file
@ -0,0 +1,238 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
|
||||
from .config import Settings
|
||||
from .models import RawItem
|
||||
|
||||
|
||||
DEFAULT_ANALYSIS = {
|
||||
"sentiment": "neutral",
|
||||
"is_positive": False,
|
||||
"is_negative": False,
|
||||
"has_actionable_feedback": False,
|
||||
"feedback_types": [],
|
||||
"reply_recommended": False,
|
||||
"reply_priority": "none",
|
||||
"reply_suggestion": "",
|
||||
"summary": "",
|
||||
"priority": "low",
|
||||
"confidence": 0.0,
|
||||
"reason": "",
|
||||
}
|
||||
|
||||
|
||||
TRANSLATION_SCHEMA = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"translated_content": {"type": "string"},
|
||||
},
|
||||
"required": ["translated_content"],
|
||||
"additionalProperties": False,
|
||||
}
|
||||
|
||||
|
||||
SCHEMA = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"sentiment": {"type": "string", "enum": ["positive", "negative", "mixed", "neutral"]},
|
||||
"is_positive": {"type": "boolean"},
|
||||
"is_negative": {"type": "boolean"},
|
||||
"has_actionable_feedback": {"type": "boolean"},
|
||||
"feedback_types": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"bug",
|
||||
"suggestion",
|
||||
"balance",
|
||||
"ui",
|
||||
"localization",
|
||||
"performance",
|
||||
"pricing",
|
||||
"content",
|
||||
"question",
|
||||
"other",
|
||||
],
|
||||
},
|
||||
},
|
||||
"reply_recommended": {"type": "boolean"},
|
||||
"reply_priority": {"type": "string", "enum": ["none", "low", "medium", "high"]},
|
||||
"reply_suggestion": {"type": "string"},
|
||||
"summary": {"type": "string"},
|
||||
"priority": {"type": "string", "enum": ["low", "medium", "high"]},
|
||||
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
|
||||
"reason": {"type": "string"},
|
||||
},
|
||||
"required": [
|
||||
"sentiment",
|
||||
"is_positive",
|
||||
"is_negative",
|
||||
"has_actionable_feedback",
|
||||
"feedback_types",
|
||||
"reply_recommended",
|
||||
"reply_priority",
|
||||
"reply_suggestion",
|
||||
"summary",
|
||||
"priority",
|
||||
"confidence",
|
||||
"reason",
|
||||
],
|
||||
"additionalProperties": False,
|
||||
}
|
||||
|
||||
|
||||
class OpenRouterClient:
|
||||
def __init__(self, settings: Settings) -> None:
|
||||
self.settings = settings
|
||||
self.enabled = bool(settings.openrouter_api_key)
|
||||
self.client = httpx.Client(timeout=60)
|
||||
|
||||
def close(self) -> None:
|
||||
self.client.close()
|
||||
|
||||
def analyze(self, item: RawItem) -> dict[str, Any]:
|
||||
if not self.enabled:
|
||||
raise MissingOpenRouterKey("OPENROUTER_API_KEY is not configured")
|
||||
|
||||
payload = {
|
||||
"model": self.settings.openrouter_model,
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": (
|
||||
"你是独立游戏《帝国幻想乡~TOHOTOPIA》的社区运营助手。"
|
||||
"请判断 Steam、Twitter/X 等社区内容的情绪、是否包含具体可处理反馈、"
|
||||
"以及是否建议制作人回复。summary、reason、reply_suggestion 必须使用中文。"
|
||||
"只输出符合 JSON Schema 的 JSON。"
|
||||
),
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": self._prompt(item),
|
||||
},
|
||||
],
|
||||
"temperature": 0.1,
|
||||
"response_format": {
|
||||
"type": "json_schema",
|
||||
"json_schema": {
|
||||
"name": "community_item_analysis",
|
||||
"strict": True,
|
||||
"schema": SCHEMA,
|
||||
},
|
||||
},
|
||||
}
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.settings.openrouter_api_key}",
|
||||
"HTTP-Referer": self.settings.openrouter_referer,
|
||||
"X-Title": self.settings.openrouter_title,
|
||||
}
|
||||
response = self.client.post(
|
||||
"https://openrouter.ai/api/v1/chat/completions",
|
||||
headers=headers,
|
||||
json=payload,
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
content = data["choices"][0]["message"]["content"]
|
||||
parsed = self._parse_json(content)
|
||||
return self._normalize(parsed)
|
||||
|
||||
def translate_to_chinese(self, content: str) -> str:
|
||||
if not self.enabled:
|
||||
raise MissingOpenRouterKey("OPENROUTER_API_KEY is not configured")
|
||||
|
||||
payload = {
|
||||
"model": self.settings.openrouter_model,
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": (
|
||||
"你是独立游戏社区运营翻译助手。"
|
||||
"把用户提供的社区内容准确翻译成简体中文,保留原意、语气、问题细节、游戏术语、链接和编号。"
|
||||
"不要添加解释。只输出符合 JSON Schema 的 JSON。"
|
||||
),
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": content[:6000],
|
||||
},
|
||||
],
|
||||
"temperature": 0,
|
||||
"response_format": {
|
||||
"type": "json_schema",
|
||||
"json_schema": {
|
||||
"name": "manual_item_translation",
|
||||
"strict": True,
|
||||
"schema": TRANSLATION_SCHEMA,
|
||||
},
|
||||
},
|
||||
}
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.settings.openrouter_api_key}",
|
||||
"HTTP-Referer": self.settings.openrouter_referer,
|
||||
"X-Title": self.settings.openrouter_title,
|
||||
}
|
||||
response = self.client.post(
|
||||
"https://openrouter.ai/api/v1/chat/completions",
|
||||
headers=headers,
|
||||
json=payload,
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
parsed = self._parse_json(data["choices"][0]["message"]["content"])
|
||||
translated = str(parsed.get("translated_content") or "").strip()
|
||||
return translated or content
|
||||
|
||||
def _prompt(self, item: RawItem) -> str:
|
||||
metadata = {
|
||||
"source": item.source,
|
||||
"content_type": item.content_type,
|
||||
"source_url": item.source_url,
|
||||
"author": item.author_name,
|
||||
"title": item.title,
|
||||
"steam_review_voted_up": item.raw.get("voted_up"),
|
||||
"language": item.raw.get("language"),
|
||||
"in_reply_to": item.raw.get("parent_url") or item.raw.get("in_reply_to"),
|
||||
"likes": item.raw.get("likes"),
|
||||
"replies": item.raw.get("replies"),
|
||||
"retweets": item.raw.get("retweets"),
|
||||
"views": item.raw.get("views"),
|
||||
}
|
||||
return (
|
||||
"请分析以下社区内容。\n\n"
|
||||
f"元数据:{json.dumps(metadata, ensure_ascii=False)}\n\n"
|
||||
f"正文:\n{item.content[:6000]}"
|
||||
)
|
||||
|
||||
def _parse_json(self, content: str) -> dict[str, Any]:
|
||||
try:
|
||||
return json.loads(content)
|
||||
except json.JSONDecodeError:
|
||||
match = re.search(r"\{.*\}", content, re.S)
|
||||
if not match:
|
||||
raise
|
||||
return json.loads(match.group(0))
|
||||
|
||||
def _normalize(self, value: dict[str, Any]) -> dict[str, Any]:
|
||||
result = dict(DEFAULT_ANALYSIS)
|
||||
result.update(value)
|
||||
result["feedback_types"] = list(result.get("feedback_types") or [])
|
||||
result["is_positive"] = bool(result.get("is_positive"))
|
||||
result["is_negative"] = bool(result.get("is_negative"))
|
||||
result["has_actionable_feedback"] = bool(result.get("has_actionable_feedback"))
|
||||
result["reply_recommended"] = bool(result.get("reply_recommended"))
|
||||
try:
|
||||
result["confidence"] = float(result.get("confidence", 0.0))
|
||||
except (TypeError, ValueError):
|
||||
result["confidence"] = 0.0
|
||||
return result
|
||||
|
||||
|
||||
class MissingOpenRouterKey(RuntimeError):
|
||||
pass
|
||||
321
app/steam.py
Normal file
321
app/steam.py
Normal file
@ -0,0 +1,321 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from hashlib import sha1
|
||||
import re
|
||||
import time
|
||||
from typing import Any, Iterable
|
||||
from urllib.parse import parse_qs, quote, urljoin, urlparse
|
||||
|
||||
from bs4 import BeautifulSoup
|
||||
import httpx
|
||||
|
||||
from .models import RawItem
|
||||
|
||||
|
||||
STEAM_STORE = "https://store.steampowered.com"
|
||||
STEAM_COMMUNITY = "https://steamcommunity.com"
|
||||
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/125.0 Safari/537.36",
|
||||
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7",
|
||||
}
|
||||
|
||||
|
||||
def content_hash(text: str) -> str:
|
||||
return sha1(text.encode("utf-8", errors="ignore")).hexdigest()
|
||||
|
||||
|
||||
def _text(node: Any) -> str:
|
||||
return node.get_text(separator="\n", strip=True) if node else ""
|
||||
|
||||
|
||||
def _abs_url(url: str) -> str:
|
||||
return urljoin(STEAM_COMMUNITY, url)
|
||||
|
||||
|
||||
def _topic_id_from_url(url: str) -> str:
|
||||
match = re.search(r"/discussions/[^/]+/(\d+)", url)
|
||||
if match:
|
||||
return match.group(1)
|
||||
return content_hash(url)
|
||||
|
||||
|
||||
def _reply_id(comment: Any, topic_id: str, author: str, timestamp: str, text: str) -> str:
|
||||
node_id = comment.get("id", "")
|
||||
if node_id:
|
||||
return node_id
|
||||
data_id = comment.get("data-commentid", "")
|
||||
if data_id:
|
||||
return data_id
|
||||
return f"{topic_id}:{content_hash(author + timestamp + text)}"
|
||||
|
||||
|
||||
def parse_steam_time(text: str | None, now: int | None = None) -> int | None:
|
||||
if not text:
|
||||
return None
|
||||
value = text.strip()
|
||||
now_ts = now or int(time.time())
|
||||
relative = re.match(r"^(\d+)\s*(分钟|小时|天|minute|minutes|hour|hours|day|days)\s*(以前|ago)?$", value, re.I)
|
||||
if relative:
|
||||
amount = int(relative.group(1))
|
||||
unit = relative.group(2).lower()
|
||||
seconds = {
|
||||
"分钟": 60,
|
||||
"minute": 60,
|
||||
"minutes": 60,
|
||||
"小时": 3600,
|
||||
"hour": 3600,
|
||||
"hours": 3600,
|
||||
"天": 86400,
|
||||
"day": 86400,
|
||||
"days": 86400,
|
||||
}[unit]
|
||||
return now_ts - amount * seconds
|
||||
|
||||
absolute = re.match(
|
||||
r"^(\d{1,2})\s*月\s*(\d{1,2})\s*日\s*(上午|下午)\s*(\d{1,2}):(\d{2})$",
|
||||
value,
|
||||
)
|
||||
if absolute:
|
||||
current = time.localtime(now_ts)
|
||||
return _make_ts(
|
||||
current.tm_year,
|
||||
int(absolute.group(1)),
|
||||
int(absolute.group(2)),
|
||||
absolute.group(3),
|
||||
int(absolute.group(4)),
|
||||
int(absolute.group(5)),
|
||||
)
|
||||
|
||||
absolute_with_year = re.match(
|
||||
r"^(\d{4})\s*年\s*(\d{1,2})\s*月\s*(\d{1,2})\s*日\s*(上午|下午)\s*(\d{1,2}):(\d{2})$",
|
||||
value,
|
||||
)
|
||||
if absolute_with_year:
|
||||
return _make_ts(
|
||||
int(absolute_with_year.group(1)),
|
||||
int(absolute_with_year.group(2)),
|
||||
int(absolute_with_year.group(3)),
|
||||
absolute_with_year.group(4),
|
||||
int(absolute_with_year.group(5)),
|
||||
int(absolute_with_year.group(6)),
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
def _make_ts(year: int, month: int, day: int, ampm: str, hour: int, minute: int) -> int:
|
||||
if ampm == "下午" and hour != 12:
|
||||
hour += 12
|
||||
if ampm == "上午" and hour == 12:
|
||||
hour = 0
|
||||
return int(time.mktime((year, month, day, hour, minute, 0, -1, -1, -1)))
|
||||
|
||||
|
||||
class SteamClient:
|
||||
def __init__(self, app_id: str) -> None:
|
||||
self.app_id = app_id
|
||||
self.client = httpx.Client(headers=HEADERS, timeout=30, follow_redirects=True)
|
||||
self.client.cookies.set("birthtime", "568022401", domain="steamcommunity.com")
|
||||
|
||||
def close(self) -> None:
|
||||
self.client.close()
|
||||
|
||||
def fetch_reviews(self, max_pages: int | None = None) -> list[RawItem]:
|
||||
cursor = "*"
|
||||
page = 0
|
||||
items: list[RawItem] = []
|
||||
while True:
|
||||
params = {
|
||||
"json": "1",
|
||||
"num_per_page": "100",
|
||||
"language": "all",
|
||||
"filter": "recent",
|
||||
"purchase_type": "all",
|
||||
"cursor": cursor,
|
||||
}
|
||||
response = self.client.get(f"{STEAM_STORE}/appreviews/{self.app_id}", params=params)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
reviews = data.get("reviews") or []
|
||||
if not reviews:
|
||||
break
|
||||
for review in reviews:
|
||||
items.append(self._review_to_item(review))
|
||||
new_cursor = data.get("cursor") or cursor
|
||||
page += 1
|
||||
if new_cursor == cursor:
|
||||
break
|
||||
if max_pages and page >= max_pages:
|
||||
break
|
||||
cursor = new_cursor
|
||||
time.sleep(0.25)
|
||||
return items
|
||||
|
||||
def fetch_discussions(self, full: bool, max_pages: int, time_limit_seconds: int) -> list[RawItem]:
|
||||
started = time.monotonic()
|
||||
topic_urls: list[str] = []
|
||||
seen_urls: set[str] = set()
|
||||
for page in range(1, max_pages + 1):
|
||||
if time.monotonic() - started > time_limit_seconds:
|
||||
break
|
||||
url = f"{STEAM_COMMUNITY}/app/{self.app_id}/discussions/"
|
||||
if page > 1:
|
||||
url = f"{url}?fp={page}"
|
||||
html = self._get_text(url)
|
||||
urls = self._extract_topic_urls(html)
|
||||
new_urls = [u for u in urls if u not in seen_urls]
|
||||
if not new_urls:
|
||||
break
|
||||
topic_urls.extend(new_urls)
|
||||
seen_urls.update(new_urls)
|
||||
if not full and page >= max_pages:
|
||||
break
|
||||
time.sleep(0.25)
|
||||
|
||||
items: list[RawItem] = []
|
||||
for url in topic_urls:
|
||||
if time.monotonic() - started > time_limit_seconds:
|
||||
break
|
||||
items.extend(self.fetch_discussion_topic(url))
|
||||
time.sleep(0.35)
|
||||
return items
|
||||
|
||||
def fetch_discussion_topic(self, url: str) -> list[RawItem]:
|
||||
html = self._get_text(url)
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
topic_id = _topic_id_from_url(url)
|
||||
title = _text(soup.select_one("div.topic")) or _text(soup.select_one(".forum_topic_name"))
|
||||
items: list[RawItem] = []
|
||||
|
||||
op = soup.select_one(".forum_op")
|
||||
if op:
|
||||
author_el = op.select_one(".authorline a")
|
||||
date_el = op.select_one(".date")
|
||||
date_text = _text(date_el)
|
||||
content_el = op.select_one(".content")
|
||||
author = _text(author_el)
|
||||
content = _text(content_el)
|
||||
source_url = url
|
||||
if content:
|
||||
items.append(
|
||||
RawItem(
|
||||
source="steam_discussions",
|
||||
source_item_id=f"topic:{topic_id}",
|
||||
source_url=source_url,
|
||||
content_type="discussion_topic",
|
||||
author_id=self._steam_id_from_author(author_el),
|
||||
author_name=author,
|
||||
title=title,
|
||||
published_at=parse_steam_time(date_text),
|
||||
published_at_text=date_text,
|
||||
updated_at_source=None,
|
||||
content=content,
|
||||
raw={
|
||||
"topic_id": topic_id,
|
||||
"topic_url": url,
|
||||
"title": title,
|
||||
"author": author,
|
||||
"date": date_text,
|
||||
"content": content,
|
||||
},
|
||||
)
|
||||
)
|
||||
|
||||
for comment in soup.select(".commentthread_comment"):
|
||||
author_el = comment.select_one(".commentthread_author_link")
|
||||
date_el = comment.select_one(".commentthread_comment_timestamp")
|
||||
text_el = comment.select_one(".commentthread_comment_text")
|
||||
text = _text(text_el)
|
||||
if not text:
|
||||
continue
|
||||
author = _text(author_el)
|
||||
timestamp = _text(date_el)
|
||||
reply_id = _reply_id(comment, topic_id, author, timestamp, text)
|
||||
reply_url = f"{url}#{reply_id}" if reply_id else url
|
||||
items.append(
|
||||
RawItem(
|
||||
source="steam_discussions",
|
||||
source_item_id=f"reply:{topic_id}:{reply_id}",
|
||||
source_url=reply_url,
|
||||
content_type="discussion_reply",
|
||||
author_id=self._steam_id_from_author(author_el),
|
||||
author_name=author,
|
||||
title=title,
|
||||
published_at=parse_steam_time(timestamp),
|
||||
published_at_text=timestamp,
|
||||
updated_at_source=None,
|
||||
content=text,
|
||||
raw={
|
||||
"topic_id": topic_id,
|
||||
"topic_url": url,
|
||||
"reply_id": reply_id,
|
||||
"reply_url": reply_url,
|
||||
"title": title,
|
||||
"reply_author": author,
|
||||
"reply_time_text": timestamp,
|
||||
"reply_content": text,
|
||||
},
|
||||
)
|
||||
)
|
||||
return items
|
||||
|
||||
def _review_to_item(self, review: dict[str, Any]) -> RawItem:
|
||||
author = review.get("author") or {}
|
||||
steam_id = str(author.get("steamid") or "")
|
||||
recommendation_id = str(review.get("recommendationid"))
|
||||
source_url = f"{STEAM_COMMUNITY}/profiles/{steam_id}/recommended/{self.app_id}/"
|
||||
raw = dict(review)
|
||||
raw["source_url"] = source_url
|
||||
return RawItem(
|
||||
source="steam_reviews",
|
||||
source_item_id=f"review:{recommendation_id}",
|
||||
source_url=source_url,
|
||||
content_type="review",
|
||||
author_id=steam_id or None,
|
||||
author_name=author.get("personaname"),
|
||||
title=None,
|
||||
published_at=review.get("timestamp_created"),
|
||||
published_at_text=None,
|
||||
updated_at_source=review.get("timestamp_updated"),
|
||||
content=review.get("review") or "",
|
||||
raw=raw,
|
||||
)
|
||||
|
||||
def _get_text(self, url: str) -> str:
|
||||
response = self.client.get(url)
|
||||
response.raise_for_status()
|
||||
response.encoding = "utf-8"
|
||||
return response.text
|
||||
|
||||
def _extract_topic_urls(self, html: str) -> list[str]:
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
urls: list[str] = []
|
||||
for link in soup.select("a.forum_topic_overlay, a.forum_topic_name"):
|
||||
href = link.get("href")
|
||||
if not href:
|
||||
continue
|
||||
url = _abs_url(href).split("?")[0]
|
||||
if f"/app/{self.app_id}/discussions/" in url and url not in urls:
|
||||
urls.append(url)
|
||||
return urls
|
||||
|
||||
def _steam_id_from_author(self, author_el: Any) -> str | None:
|
||||
if not author_el:
|
||||
return None
|
||||
href = author_el.get("href") or ""
|
||||
parsed = urlparse(href)
|
||||
if "/profiles/" in parsed.path:
|
||||
return parsed.path.rstrip("/").split("/")[-1]
|
||||
if "/id/" in parsed.path:
|
||||
return parsed.path.rstrip("/").split("/")[-1]
|
||||
query = parse_qs(parsed.query)
|
||||
steam_id = query.get("steamid")
|
||||
return steam_id[0] if steam_id else None
|
||||
|
||||
|
||||
def iter_nonempty(items: Iterable[RawItem]) -> Iterable[RawItem]:
|
||||
for item in items:
|
||||
if item.content.strip():
|
||||
yield item
|
||||
366
app/sync.py
Normal file
366
app/sync.py
Normal file
@ -0,0 +1,366 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from collections import Counter
|
||||
from hashlib import sha1
|
||||
import sqlite3
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
from .config import Settings
|
||||
from .db import decode_json, encode_json, init_db
|
||||
from .models import RawItem
|
||||
from .openrouter import OpenRouterClient
|
||||
from .steam import SteamClient, iter_nonempty
|
||||
from .twitter import TwitterClient, TwitterScrapeOptions
|
||||
|
||||
|
||||
def _now() -> int:
|
||||
return int(time.time())
|
||||
|
||||
|
||||
def _hash(text: str) -> str:
|
||||
return sha1(text.encode("utf-8", errors="ignore")).hexdigest()
|
||||
|
||||
|
||||
def upsert_raw_item(conn: sqlite3.Connection, item: RawItem) -> tuple[int, bool]:
|
||||
now = _now()
|
||||
item_hash = _hash(item.content)
|
||||
existing = conn.execute(
|
||||
"SELECT id, content_hash FROM raw_items WHERE source = ? AND source_item_id = ?",
|
||||
(item.source, item.source_item_id),
|
||||
).fetchone()
|
||||
if existing:
|
||||
if existing["content_hash"] != item_hash:
|
||||
conn.execute(
|
||||
"""
|
||||
UPDATE raw_items
|
||||
SET source_url = ?, author_id = ?, author_name = ?, title = ?,
|
||||
published_at = ?, published_at_text = ?, updated_at_source = ?,
|
||||
content = ?, raw_json = ?, content_hash = ?, analysis_status = 'pending',
|
||||
collected_at = ?
|
||||
WHERE id = ?
|
||||
""",
|
||||
(
|
||||
item.source_url,
|
||||
item.author_id,
|
||||
item.author_name,
|
||||
item.title,
|
||||
item.published_at,
|
||||
item.published_at_text,
|
||||
item.updated_at_source,
|
||||
item.content,
|
||||
encode_json(item.raw),
|
||||
item_hash,
|
||||
now,
|
||||
existing["id"],
|
||||
),
|
||||
)
|
||||
return int(existing["id"]), False
|
||||
|
||||
cursor = conn.execute(
|
||||
"""
|
||||
INSERT INTO raw_items (
|
||||
source, source_item_id, source_url, content_type, author_id, author_name,
|
||||
title, published_at, published_at_text, collected_at, updated_at_source,
|
||||
content, raw_json, content_hash, analysis_status
|
||||
)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 'pending')
|
||||
""",
|
||||
(
|
||||
item.source,
|
||||
item.source_item_id,
|
||||
item.source_url,
|
||||
item.content_type,
|
||||
item.author_id,
|
||||
item.author_name,
|
||||
item.title,
|
||||
item.published_at,
|
||||
item.published_at_text,
|
||||
now,
|
||||
item.updated_at_source,
|
||||
item.content,
|
||||
encode_json(item.raw),
|
||||
item_hash,
|
||||
),
|
||||
)
|
||||
raw_item_id = int(cursor.lastrowid)
|
||||
conn.execute(
|
||||
"""
|
||||
INSERT INTO work_items (raw_item_id, status, owner, notes, created_at, updated_at)
|
||||
VALUES (?, 'new', '', '', ?, ?)
|
||||
""",
|
||||
(raw_item_id, now, now),
|
||||
)
|
||||
return raw_item_id, True
|
||||
|
||||
|
||||
def save_analysis(
|
||||
conn: sqlite3.Connection,
|
||||
raw_item_id: int,
|
||||
model: str,
|
||||
analysis: dict[str, Any],
|
||||
) -> None:
|
||||
now = _now()
|
||||
conn.execute(
|
||||
"""
|
||||
INSERT INTO analysis_results (
|
||||
raw_item_id, model, sentiment, is_positive, is_negative,
|
||||
has_actionable_feedback, feedback_types, reply_recommended, reply_priority,
|
||||
reply_suggestion, summary, priority, confidence, reason, model_json, analyzed_at
|
||||
)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||
ON CONFLICT(raw_item_id) DO UPDATE SET
|
||||
model = excluded.model,
|
||||
sentiment = excluded.sentiment,
|
||||
is_positive = excluded.is_positive,
|
||||
is_negative = excluded.is_negative,
|
||||
has_actionable_feedback = excluded.has_actionable_feedback,
|
||||
feedback_types = excluded.feedback_types,
|
||||
reply_recommended = excluded.reply_recommended,
|
||||
reply_priority = excluded.reply_priority,
|
||||
reply_suggestion = excluded.reply_suggestion,
|
||||
summary = excluded.summary,
|
||||
priority = excluded.priority,
|
||||
confidence = excluded.confidence,
|
||||
reason = excluded.reason,
|
||||
model_json = excluded.model_json,
|
||||
analyzed_at = excluded.analyzed_at
|
||||
""",
|
||||
(
|
||||
raw_item_id,
|
||||
model,
|
||||
analysis["sentiment"],
|
||||
int(analysis["is_positive"]),
|
||||
int(analysis["is_negative"]),
|
||||
int(analysis["has_actionable_feedback"]),
|
||||
encode_json(analysis["feedback_types"]),
|
||||
int(analysis["reply_recommended"]),
|
||||
analysis["reply_priority"],
|
||||
analysis["reply_suggestion"],
|
||||
analysis["summary"],
|
||||
analysis["priority"],
|
||||
analysis["confidence"],
|
||||
analysis["reason"],
|
||||
encode_json(analysis),
|
||||
now,
|
||||
),
|
||||
)
|
||||
conn.execute("UPDATE raw_items SET analysis_status = 'done' WHERE id = ?", (raw_item_id,))
|
||||
|
||||
|
||||
def _twitter_high_watermark_ts(conn: sqlite3.Connection) -> int | None:
|
||||
row = conn.execute(
|
||||
"""
|
||||
SELECT MAX(COALESCE(published_at, collected_at)) AS watermark
|
||||
FROM raw_items
|
||||
WHERE source IN ('twitter_posts', 'twitter_replies')
|
||||
"""
|
||||
).fetchone()
|
||||
if row and row["watermark"]:
|
||||
return int(row["watermark"])
|
||||
return None
|
||||
|
||||
|
||||
def _recent_twitter_post_urls(conn: sqlite3.Connection, limit: int) -> list[str]:
|
||||
if limit <= 0:
|
||||
return []
|
||||
rows = conn.execute(
|
||||
"""
|
||||
SELECT source_url
|
||||
FROM raw_items
|
||||
WHERE source = 'twitter_posts'
|
||||
ORDER BY COALESCE(published_at, collected_at) DESC, collected_at DESC
|
||||
LIMIT ?
|
||||
""",
|
||||
(limit,),
|
||||
).fetchall()
|
||||
return [str(row["source_url"]) for row in rows if row["source_url"]]
|
||||
|
||||
|
||||
def _twitter_options(settings: Settings) -> TwitterScrapeOptions:
|
||||
return TwitterScrapeOptions(
|
||||
username=settings.twitter_username,
|
||||
scraper_path=settings.twitter_scraper_path,
|
||||
output_dir=settings.twitter_output_dir,
|
||||
browser_provider=settings.twitter_browser_provider,
|
||||
full_max_no_new=settings.twitter_full_max_no_new,
|
||||
incremental_max_no_new=settings.twitter_incremental_max_no_new,
|
||||
thread_max_no_new=settings.twitter_thread_max_no_new,
|
||||
command_timeout_seconds=settings.twitter_command_timeout_seconds,
|
||||
full_reply_post_limit=settings.twitter_full_reply_post_limit,
|
||||
incremental_reply_parent_limit=settings.twitter_incremental_reply_parent_limit,
|
||||
)
|
||||
|
||||
|
||||
def run_sync(
|
||||
conn: sqlite3.Connection,
|
||||
settings: Settings,
|
||||
full: bool = False,
|
||||
platforms: list[str] | None = None,
|
||||
) -> dict[str, Any]:
|
||||
init_db(conn)
|
||||
started = _now()
|
||||
mode = "full" if full else "incremental"
|
||||
run_id = conn.execute(
|
||||
"INSERT INTO sync_runs (started_at, mode, status) VALUES (?, ?, 'running')",
|
||||
(started, mode),
|
||||
).lastrowid
|
||||
conn.commit()
|
||||
|
||||
stats: Counter[str] = Counter()
|
||||
messages: list[str] = []
|
||||
try:
|
||||
enabled_platforms = platforms or ["steam", "twitter"]
|
||||
if "twitter" in enabled_platforms and not settings.twitter_enabled:
|
||||
stats["twitter_skipped"] += 1
|
||||
raw_items: list[RawItem] = []
|
||||
if "steam" in enabled_platforms:
|
||||
steam = SteamClient(settings.app_id)
|
||||
try:
|
||||
review_pages = None if full else 2
|
||||
review_items = steam.fetch_reviews(max_pages=review_pages)
|
||||
discussion_pages = (
|
||||
settings.discussion_full_scan_max_pages
|
||||
if full
|
||||
else settings.discussion_incremental_max_pages
|
||||
)
|
||||
discussion_items = steam.fetch_discussions(
|
||||
full=full,
|
||||
max_pages=discussion_pages,
|
||||
time_limit_seconds=settings.full_scan_time_limit_seconds,
|
||||
)
|
||||
steam_items = list(iter_nonempty([*review_items, *discussion_items]))
|
||||
raw_items.extend(steam_items)
|
||||
stats["steam_fetched"] = len(steam_items)
|
||||
finally:
|
||||
steam.close()
|
||||
|
||||
if "twitter" in enabled_platforms and settings.twitter_enabled:
|
||||
try:
|
||||
since_ts = None if full else _twitter_high_watermark_ts(conn)
|
||||
existing_urls = _recent_twitter_post_urls(
|
||||
conn,
|
||||
settings.twitter_incremental_reply_parent_limit,
|
||||
)
|
||||
twitter = TwitterClient(_twitter_options(settings))
|
||||
twitter_items = twitter.fetch_items(
|
||||
full=full,
|
||||
since_ts=since_ts,
|
||||
existing_post_urls=existing_urls,
|
||||
)
|
||||
raw_items.extend(twitter_items)
|
||||
stats["twitter_fetched"] = len(twitter_items)
|
||||
except Exception as exc: # noqa: BLE001 - keep Steam and old Twitter data intact
|
||||
stats["twitter_errors"] += 1
|
||||
stats[f"twitter_error:{type(exc).__name__}"] += 1
|
||||
messages.append(f"twitter: {exc}")
|
||||
|
||||
stats["fetched"] = len(raw_items)
|
||||
analyzer = OpenRouterClient(settings)
|
||||
try:
|
||||
for item in raw_items:
|
||||
raw_item_id, inserted = upsert_raw_item(conn, item)
|
||||
prefix = item.source.split("_", 1)[0]
|
||||
stats["inserted" if inserted else "seen"] += 1
|
||||
stats[f"{prefix}_{'inserted' if inserted else 'seen'}"] += 1
|
||||
if inserted:
|
||||
try:
|
||||
analysis = analyzer.analyze(item)
|
||||
save_analysis(conn, raw_item_id, settings.openrouter_model, analysis)
|
||||
stats["analyzed"] += 1
|
||||
except Exception as exc: # noqa: BLE001 - keep item pending for retry
|
||||
conn.execute(
|
||||
"UPDATE raw_items SET analysis_status = 'error' WHERE id = ?",
|
||||
(raw_item_id,),
|
||||
)
|
||||
stats["analysis_errors"] += 1
|
||||
stats[f"analysis_error:{type(exc).__name__}"] += 1
|
||||
conn.commit()
|
||||
finally:
|
||||
analyzer.close()
|
||||
|
||||
finished = _now()
|
||||
status = "partial" if messages else "success"
|
||||
conn.execute(
|
||||
"""
|
||||
UPDATE sync_runs
|
||||
SET finished_at = ?, status = ?, message = ?, stats_json = ?
|
||||
WHERE id = ?
|
||||
""",
|
||||
(finished, status, "\n".join(messages), encode_json(dict(stats)), run_id),
|
||||
)
|
||||
if status == "success":
|
||||
conn.execute(
|
||||
"""
|
||||
INSERT INTO sync_state (key, value, updated_at)
|
||||
VALUES ('last_sync_mode', ?, ?)
|
||||
ON CONFLICT(key) DO UPDATE SET value = excluded.value, updated_at = excluded.updated_at
|
||||
""",
|
||||
(mode, finished),
|
||||
)
|
||||
return dict(stats)
|
||||
except Exception as exc:
|
||||
finished = _now()
|
||||
conn.execute(
|
||||
"""
|
||||
UPDATE sync_runs
|
||||
SET finished_at = ?, status = 'failed', message = ?, stats_json = ?
|
||||
WHERE id = ?
|
||||
""",
|
||||
(finished, str(exc), encode_json(dict(stats)), run_id),
|
||||
)
|
||||
raise
|
||||
|
||||
|
||||
def analyze_pending(
|
||||
conn: sqlite3.Connection,
|
||||
settings: Settings,
|
||||
limit: int = 50,
|
||||
since_ts: int | None = None,
|
||||
) -> dict[str, Any]:
|
||||
init_db(conn)
|
||||
analyzer = OpenRouterClient(settings)
|
||||
stats: Counter[str] = Counter()
|
||||
try:
|
||||
params: list[Any] = []
|
||||
since_clause = ""
|
||||
if since_ts is not None:
|
||||
since_clause = "AND COALESCE(published_at, collected_at) >= ?"
|
||||
params.append(since_ts)
|
||||
params.append(limit)
|
||||
rows = conn.execute(
|
||||
f"""
|
||||
SELECT * FROM raw_items
|
||||
WHERE analysis_status IN ('pending', 'error')
|
||||
{since_clause}
|
||||
ORDER BY COALESCE(published_at, collected_at) DESC, collected_at DESC, id DESC
|
||||
LIMIT ?
|
||||
""",
|
||||
params,
|
||||
).fetchall()
|
||||
for row in rows:
|
||||
item = RawItem(
|
||||
source=row["source"],
|
||||
source_item_id=row["source_item_id"],
|
||||
source_url=row["source_url"],
|
||||
content_type=row["content_type"],
|
||||
author_id=row["author_id"],
|
||||
author_name=row["author_name"],
|
||||
title=row["title"],
|
||||
published_at=row["published_at"],
|
||||
published_at_text=row["published_at_text"],
|
||||
updated_at_source=row["updated_at_source"],
|
||||
content=row["content"],
|
||||
raw=decode_json(row["raw_json"], {}),
|
||||
)
|
||||
try:
|
||||
analysis = analyzer.analyze(item)
|
||||
save_analysis(conn, int(row["id"]), settings.openrouter_model, analysis)
|
||||
stats["analyzed"] += 1
|
||||
conn.commit()
|
||||
except Exception as exc: # noqa: BLE001
|
||||
stats["analysis_errors"] += 1
|
||||
stats[f"analysis_error:{type(exc).__name__}"] += 1
|
||||
return dict(stats)
|
||||
finally:
|
||||
analyzer.close()
|
||||
246
app/twitter.py
Normal file
246
app/twitter.py
Normal file
@ -0,0 +1,246 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
import calendar
|
||||
import json
|
||||
from pathlib import Path
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from typing import Any, Iterable
|
||||
|
||||
from .models import RawItem
|
||||
|
||||
|
||||
TWITTER_EPOCH_FORMAT = "%a %b %d %H:%M:%S +0000 %Y"
|
||||
NORMALIZED_DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class TwitterScrapeOptions:
|
||||
username: str
|
||||
scraper_path: Path
|
||||
output_dir: Path
|
||||
browser_provider: str
|
||||
full_max_no_new: int
|
||||
incremental_max_no_new: int
|
||||
thread_max_no_new: int
|
||||
command_timeout_seconds: int
|
||||
full_reply_post_limit: int
|
||||
incremental_reply_parent_limit: int
|
||||
|
||||
|
||||
def parse_twitter_time(value: str | None) -> int | None:
|
||||
if not value:
|
||||
return None
|
||||
text = value.strip()
|
||||
for fmt in (NORMALIZED_DATE_FORMAT, TWITTER_EPOCH_FORMAT):
|
||||
try:
|
||||
parsed = time.strptime(text, fmt)
|
||||
return calendar.timegm(parsed)
|
||||
except ValueError:
|
||||
continue
|
||||
return None
|
||||
|
||||
|
||||
def _author_from_url(url: str | None) -> str | None:
|
||||
if not url:
|
||||
return None
|
||||
match = re.search(r"(?:x\.com|twitter\.com)/([^/?#]+)/status/\d+", url)
|
||||
if not match:
|
||||
return None
|
||||
value = match.group(1)
|
||||
return value if value and value.lower() != "i" else None
|
||||
|
||||
|
||||
def _tweet_id_from_item(item: dict[str, Any]) -> str | None:
|
||||
value = item.get("id")
|
||||
if value:
|
||||
return str(value)
|
||||
url = str(item.get("url") or "")
|
||||
match = re.search(r"/status/(\d+)", url)
|
||||
return match.group(1) if match else None
|
||||
|
||||
|
||||
def _tweet_url(username: str, tweet_id: str) -> str:
|
||||
return f"https://x.com/{username}/status/{tweet_id}"
|
||||
|
||||
|
||||
def _is_original_post(item: dict[str, Any]) -> bool:
|
||||
return not bool(item.get("is_retweet"))
|
||||
|
||||
|
||||
class TwitterClient:
|
||||
def __init__(self, options: TwitterScrapeOptions) -> None:
|
||||
self.options = options
|
||||
|
||||
def fetch_items(
|
||||
self,
|
||||
*,
|
||||
full: bool,
|
||||
since_ts: int | None,
|
||||
existing_post_urls: Iterable[str] = (),
|
||||
) -> list[RawItem]:
|
||||
run_dir = self._new_run_dir()
|
||||
timeline = self._fetch_timeline(run_dir, full=full)
|
||||
timeline_items = [
|
||||
self._post_to_item(item)
|
||||
for item in timeline
|
||||
if self._include_by_time(item, since_ts)
|
||||
]
|
||||
|
||||
reply_parent_urls = self._reply_parent_urls(
|
||||
timeline=timeline,
|
||||
full=full,
|
||||
existing_post_urls=existing_post_urls,
|
||||
)
|
||||
reply_items: list[RawItem] = []
|
||||
for parent_url in reply_parent_urls:
|
||||
thread = self._fetch_thread(run_dir, parent_url)
|
||||
parent_id = str(thread.get("main_tweet", {}).get("id") or self._id_from_url(parent_url) or "")
|
||||
for reply in thread.get("replies") or []:
|
||||
if self._include_by_time(reply, since_ts):
|
||||
reply_items.append(self._reply_to_item(reply, parent_id=parent_id, parent_url=parent_url))
|
||||
|
||||
return [item for item in [*timeline_items, *reply_items] if item.content.strip()]
|
||||
|
||||
def _new_run_dir(self) -> Path:
|
||||
path = self.options.output_dir / time.strftime("%Y%m%d_%H%M%S")
|
||||
path.mkdir(parents=True, exist_ok=True)
|
||||
return path
|
||||
|
||||
def _fetch_timeline(self, run_dir: Path, *, full: bool) -> list[dict[str, Any]]:
|
||||
max_no_new = self.options.full_max_no_new if full else self.options.incremental_max_no_new
|
||||
self._run_scraper(self.options.username, run_dir, max_no_new=max_no_new)
|
||||
path = run_dir / f"{self.options.username}_posts.json"
|
||||
return self._read_json(path, expected="timeline posts")
|
||||
|
||||
def _fetch_thread(self, run_dir: Path, parent_url: str) -> dict[str, Any]:
|
||||
tweet_id = self._id_from_url(parent_url)
|
||||
if not tweet_id:
|
||||
return {"main_tweet": None, "replies": [], "total_replies": 0}
|
||||
self._run_scraper(parent_url, run_dir, max_no_new=self.options.thread_max_no_new)
|
||||
path = run_dir / f"thread_{tweet_id}.json"
|
||||
return self._read_json(path, expected=f"thread {tweet_id}")
|
||||
|
||||
def _run_scraper(self, target: str, run_dir: Path, *, max_no_new: int) -> None:
|
||||
command = [
|
||||
sys.executable,
|
||||
str(self.options.scraper_path),
|
||||
target,
|
||||
"--max-no-new",
|
||||
str(max_no_new),
|
||||
"--output-dir",
|
||||
str(run_dir),
|
||||
"--browser-provider",
|
||||
self.options.browser_provider,
|
||||
]
|
||||
result = subprocess.run(
|
||||
command,
|
||||
cwd=Path.cwd(),
|
||||
capture_output=True,
|
||||
text=True,
|
||||
encoding="utf-8",
|
||||
errors="replace",
|
||||
timeout=self.options.command_timeout_seconds,
|
||||
)
|
||||
output = "\n".join(part for part in [result.stdout, result.stderr] if part).strip()
|
||||
if result.returncode != 0:
|
||||
raise RuntimeError(f"Twitter scraper failed for {target}: {output[-1200:]}")
|
||||
if "登录提示" in output or "未登录" in output or "login" in output.lower():
|
||||
raise RuntimeError(
|
||||
"Twitter scraper requires an authenticated X.com browser profile. "
|
||||
"Run the configured social-media-scraper once with --keep-browser-open, "
|
||||
"log in to X.com, then retry."
|
||||
)
|
||||
|
||||
def _read_json(self, path: Path, *, expected: str) -> Any:
|
||||
if not path.exists():
|
||||
raise RuntimeError(f"Twitter scraper did not produce {expected}: {path}")
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
def _reply_parent_urls(
|
||||
self,
|
||||
*,
|
||||
timeline: list[dict[str, Any]],
|
||||
full: bool,
|
||||
existing_post_urls: Iterable[str],
|
||||
) -> list[str]:
|
||||
urls: list[str] = []
|
||||
for item in timeline:
|
||||
tweet_id = _tweet_id_from_item(item)
|
||||
url = item.get("url") or (_tweet_url(self.options.username, tweet_id) if tweet_id else "")
|
||||
if url and _is_original_post(item):
|
||||
urls.append(str(url))
|
||||
|
||||
if not full:
|
||||
urls.extend(str(url) for url in existing_post_urls if url)
|
||||
|
||||
seen: set[str] = set()
|
||||
unique_urls: list[str] = []
|
||||
for url in urls:
|
||||
if url not in seen:
|
||||
seen.add(url)
|
||||
unique_urls.append(url)
|
||||
|
||||
limit = self.options.full_reply_post_limit if full else self.options.incremental_reply_parent_limit
|
||||
if limit > 0:
|
||||
return unique_urls[:limit]
|
||||
return unique_urls
|
||||
|
||||
def _post_to_item(self, item: dict[str, Any]) -> RawItem:
|
||||
tweet_id = _tweet_id_from_item(item) or ""
|
||||
url = item.get("url") or _tweet_url(self.options.username, tweet_id)
|
||||
author = _author_from_url(str(url)) or self.options.username
|
||||
raw = dict(item)
|
||||
raw["source_url"] = url
|
||||
return RawItem(
|
||||
source="twitter_posts",
|
||||
source_item_id=f"post:{tweet_id}",
|
||||
source_url=str(url),
|
||||
content_type="twitter_post",
|
||||
author_id=author,
|
||||
author_name=author,
|
||||
title=None,
|
||||
published_at=parse_twitter_time(item.get("date")),
|
||||
published_at_text=item.get("date"),
|
||||
updated_at_source=None,
|
||||
content=str(item.get("text") or ""),
|
||||
raw=raw,
|
||||
)
|
||||
|
||||
def _reply_to_item(self, item: dict[str, Any], *, parent_id: str, parent_url: str) -> RawItem:
|
||||
tweet_id = _tweet_id_from_item(item) or ""
|
||||
url = item.get("url") or _tweet_url(_author_from_url(parent_url) or self.options.username, tweet_id)
|
||||
author = _author_from_url(str(url)) or str(item.get("in_reply_to") or "")
|
||||
raw = dict(item)
|
||||
raw["parent_tweet_id"] = parent_id
|
||||
raw["parent_url"] = parent_url
|
||||
raw["source_url"] = url
|
||||
return RawItem(
|
||||
source="twitter_replies",
|
||||
source_item_id=f"reply:{tweet_id}",
|
||||
source_url=str(url),
|
||||
content_type="twitter_reply",
|
||||
author_id=author or None,
|
||||
author_name=author or None,
|
||||
title=f"Reply to {parent_id}" if parent_id else None,
|
||||
published_at=parse_twitter_time(item.get("date")),
|
||||
published_at_text=item.get("date"),
|
||||
updated_at_source=None,
|
||||
content=str(item.get("text") or ""),
|
||||
raw=raw,
|
||||
)
|
||||
|
||||
def _include_by_time(self, item: dict[str, Any], since_ts: int | None) -> bool:
|
||||
if since_ts is None:
|
||||
return True
|
||||
published_at = parse_twitter_time(item.get("date"))
|
||||
if published_at is None:
|
||||
return True
|
||||
return published_at >= since_ts
|
||||
|
||||
def _id_from_url(self, url: str) -> str | None:
|
||||
match = re.search(r"/status/(\d+)", url)
|
||||
return match.group(1) if match else None
|
||||
8
requirements.txt
Normal file
8
requirements.txt
Normal file
@ -0,0 +1,8 @@
|
||||
beautifulsoup4==4.12.3
|
||||
fastapi==0.115.6
|
||||
httpx==0.28.1
|
||||
python-multipart==0.0.20
|
||||
python-dotenv==1.0.1
|
||||
playwright==1.56.0
|
||||
requests==2.31.0
|
||||
uvicorn==0.34.0
|
||||
307
任务/方案/steam社区监控一期计划.md
Normal file
307
任务/方案/steam社区监控一期计划.md
Normal file
@ -0,0 +1,307 @@
|
||||
# Steam 社区监控一期计划
|
||||
|
||||
## 目标
|
||||
|
||||
第一阶段先接入 Steam 两个信息源:
|
||||
|
||||
1. Steam 评测信息
|
||||
2. Steam 讨论社区信息:`https://steamcommunity.com/app/3774440/discussions`
|
||||
|
||||
系统每 30 分钟刷新一次。第一轮全量抓取 Steam 评测、讨论区主题和讨论区回复;后续只做增量更新。所有新增内容调用 OpenRouter 的 `deepseek/deepseek-v4-pro` 做分类和回复必要性评估,并在 dashboard 中展示、筛选、高亮和追踪人工处理状态。
|
||||
|
||||
## 已确认事实
|
||||
|
||||
| 判断 | 类型 | 证据 | 决策影响 |
|
||||
|---|---|---|---|
|
||||
| AppID 为 `3774440` 的 Steam 评测 API 当前有数据 | 当前事实 | 本地请求 `https://store.steampowered.com/appreviews/3774440?...` 成功,返回 `total_reviews=130`、`review_score_desc=Very Positive` | 一期可以直接接入评测 API |
|
||||
| Steam 讨论区页面当前可访问 | 当前事实 | 本地请求 `https://steamcommunity.com/app/3774440/discussions/` 返回 HTTP 200,页面包含 forum/topic 内容 | 一期可以用 HTTP + HTML 解析抓讨论区 |
|
||||
| `deepseek/deepseek-v4-pro` 当前存在于 OpenRouter 模型列表 | 当前事实 | 本地请求 OpenRouter models API 返回该模型,支持 `response_format` 和 `structured_outputs` | 一期可按结构化 JSON 分类设计 |
|
||||
| Steam 评测数量存在口径差异风险 | 经验事实 | 用户级经验记录:Steam `appreviews` 受缓存、语言、购买类型和索引延迟影响 | 统计口径不能只依赖单一请求 |
|
||||
|
||||
## 一期范围
|
||||
|
||||
### 做
|
||||
|
||||
- 每 30 分钟刷新 Steam 评测和 Steam 讨论区。
|
||||
- 第一轮全量抓取;后续增量抓取新增或更新内容。
|
||||
- 对 Steam 评测、讨论区主题、讨论区回复分别去重入库。
|
||||
- 调用 OpenRouter 模型输出结构化分类结果。
|
||||
- Dashboard 展示评论/帖子/回复列表、分类结果、原始链接、回复建议和人工处理状态。
|
||||
- 支持本机运行,架构上预留服务器部署。
|
||||
|
||||
### 暂不做
|
||||
|
||||
- 暂不接入 Steam 以外社区。
|
||||
- 暂不做复杂账号权限系统;服务器部署前再补认证方案。
|
||||
- 暂不自动回复玩家,只做信息发现、分类和处理追踪。
|
||||
- 暂不做语言筛选;所有语言统一进入采集和模型评估。
|
||||
|
||||
## 采集流程
|
||||
|
||||
### Steam 评测
|
||||
|
||||
使用 Steam Store Reviews API:
|
||||
|
||||
```text
|
||||
GET https://store.steampowered.com/appreviews/3774440
|
||||
```
|
||||
|
||||
基础参数:
|
||||
|
||||
- `json=1`
|
||||
- `num_per_page=100`
|
||||
- `language=all`
|
||||
- `filter=recent`
|
||||
- `purchase_type=all`
|
||||
- `cursor=*` 起步,后续使用响应中的 cursor 翻页
|
||||
|
||||
评测去重主键:
|
||||
|
||||
- `steam_review:{recommendationid}`
|
||||
|
||||
评测建议保留字段:
|
||||
|
||||
- `recommendationid`
|
||||
- `voted_up`
|
||||
- `review`
|
||||
- `language`
|
||||
- `timestamp_created`
|
||||
- `timestamp_updated`
|
||||
- `author.steamid`
|
||||
- `author.personaname`
|
||||
- `author.profile_url`
|
||||
- `author.playtime_forever`
|
||||
- `votes_up`
|
||||
- `comment_count`
|
||||
- `steam_purchase`
|
||||
- `received_for_free`
|
||||
- `source_url`
|
||||
|
||||
评测链接可由 `recommendationid` 构造:
|
||||
|
||||
```text
|
||||
https://steamcommunity.com/profiles/{steamid}/recommended/3774440/#developer_response
|
||||
```
|
||||
|
||||
若用户 profile URL 可用,也应保留原始 `profile_url` 作为辅助追溯字段。
|
||||
|
||||
### Steam 讨论区
|
||||
|
||||
使用 HTTP 请求讨论区列表页:
|
||||
|
||||
```text
|
||||
https://steamcommunity.com/app/3774440/discussions/
|
||||
```
|
||||
|
||||
翻页参数:
|
||||
|
||||
```text
|
||||
?fp=2
|
||||
?fp=3
|
||||
```
|
||||
|
||||
第一轮抓取所有可访问讨论页和所有可访问回复。后续增量刷新时,从最新列表页开始向后翻页,直到遇到本地已存在且未更新的主题为止;若 Steam 页面无法稳定判断更新时间,则以最近若干页作为增量窗口,并保留手动全量重扫入口。
|
||||
|
||||
讨论区去重主键:
|
||||
|
||||
- 主题:`steam_discussion_topic:{topic_id}`
|
||||
- 回复:`steam_discussion_reply:{topic_id}:{reply_id}`,如果页面拿不到稳定 reply id,则用 `topic_id + author + timestamp + content_hash`
|
||||
|
||||
讨论区建议保留字段:
|
||||
|
||||
- `topic_id`
|
||||
- `topic_url`
|
||||
- `title`
|
||||
- `author`
|
||||
- `published_at_text`
|
||||
- `content`
|
||||
- `reply_count`
|
||||
- `reply_author`
|
||||
- `reply_time_text`
|
||||
- `reply_content`
|
||||
- `reply_url`
|
||||
- `source_url`
|
||||
|
||||
## 数据模型
|
||||
|
||||
建议先用 SQLite 跑通本机版本;部署服务器时可迁移 PostgreSQL。
|
||||
|
||||
核心表可以先压成三类:
|
||||
|
||||
### `raw_items`
|
||||
|
||||
保存原始社区内容及来源信息。
|
||||
|
||||
关键字段:
|
||||
|
||||
- `id`
|
||||
- `source`
|
||||
- `source_item_id`
|
||||
- `source_url`
|
||||
- `content_type`
|
||||
- `author_id`
|
||||
- `author_name`
|
||||
- `published_at`
|
||||
- `collected_at`
|
||||
- `content`
|
||||
- `raw_json`
|
||||
- `content_hash`
|
||||
|
||||
### `analysis_results`
|
||||
|
||||
保存模型分类结果。
|
||||
|
||||
关键字段:
|
||||
|
||||
- `raw_item_id`
|
||||
- `model`
|
||||
- `sentiment`
|
||||
- `is_positive`
|
||||
- `is_negative`
|
||||
- `has_actionable_feedback`
|
||||
- `feedback_types`
|
||||
- `reply_recommended`
|
||||
- `reply_priority`
|
||||
- `reply_suggestion`
|
||||
- `summary`
|
||||
- `priority`
|
||||
- `confidence`
|
||||
- `model_json`
|
||||
- `analyzed_at`
|
||||
|
||||
### `work_items`
|
||||
|
||||
保存人工处理状态。
|
||||
|
||||
关键字段:
|
||||
|
||||
- `raw_item_id`
|
||||
- `status`
|
||||
- `owner`
|
||||
- `notes`
|
||||
- `last_handled_at`
|
||||
- `created_at`
|
||||
- `updated_at`
|
||||
|
||||
状态枚举建议:
|
||||
|
||||
- `new`
|
||||
- `read`
|
||||
- `needs_reply`
|
||||
- `replied`
|
||||
- `needs_fix`
|
||||
- `archived`
|
||||
|
||||
## OpenRouter 分类方案
|
||||
|
||||
模型:
|
||||
|
||||
```text
|
||||
deepseek/deepseek-v4-pro
|
||||
```
|
||||
|
||||
OpenRouter Key:
|
||||
|
||||
- 本机和服务器都使用 `.env` / 环境变量读取,不在项目文件中明文保存。
|
||||
- 用户级 `auth.json` 只作为本机开发时迁移 key 的来源,不作为项目运行时依赖。
|
||||
- 推荐变量名:`OPENROUTER_API_KEY`。
|
||||
|
||||
目标输出 JSON:
|
||||
|
||||
```json
|
||||
{
|
||||
"sentiment": "positive | negative | mixed | neutral",
|
||||
"is_positive": true,
|
||||
"is_negative": false,
|
||||
"has_actionable_feedback": true,
|
||||
"feedback_types": ["bug", "suggestion", "balance", "ui", "localization", "performance", "pricing", "content", "question", "other"],
|
||||
"reply_recommended": true,
|
||||
"reply_priority": "none | low | medium | high",
|
||||
"reply_suggestion": "建议运营或开发如何回复;不需要回复时为空字符串",
|
||||
"summary": "一句话摘要",
|
||||
"priority": "low | medium | high",
|
||||
"confidence": 0.0,
|
||||
"reason": "简短分类依据"
|
||||
}
|
||||
```
|
||||
|
||||
分类规则:
|
||||
|
||||
- `is_positive` / `is_negative` 对应用户要求的好评、差评展示。
|
||||
- `has_actionable_feedback=true` 表示包含具体建议、问题反馈、bug、平衡性、UI、翻译、本地化、性能、价格、内容量等可处理信息。
|
||||
- `reply_recommended=true` 表示建议人工回复或处理,高优先级内容需要在 dashboard 高亮。
|
||||
- 讨论区主题和回复都必须进入模型评估;不能只评估主题原帖。
|
||||
- Steam 评测本身的 `voted_up` 作为强信号,但不要覆盖文本判断;例如推荐评测里也可能包含具体差评点。
|
||||
- 每条结果必须保留 `source_url`,dashboard 中直接跳转原始评论或讨论帖。
|
||||
|
||||
## Dashboard 一期页面
|
||||
|
||||
第一版页面不追求复杂,重点是运营处理效率。
|
||||
|
||||
建议视图:
|
||||
|
||||
- 总览指标:新增数量、未处理数量、差评数量、具体反馈数量、高优先级数量、已分析数量、待补跑数量、最近更新时间。
|
||||
- 内容列表:来源、内容类型、时间、作者、摘要、情绪、反馈类型、优先级、是否建议回复、处理状态、原始链接。
|
||||
- 筛选:信息源、内容类型、情绪、是否具体反馈、是否建议回复、反馈类型、处理状态、时间范围。
|
||||
- 高亮:`reply_recommended=true` 或 `priority=high` 的帖子/回复。
|
||||
- 详情:原文、模型分类、回复建议、原始链接、备注、负责人、状态变更。
|
||||
- 排序:建议回复优先;同组内按发布时间新到旧。
|
||||
|
||||
## 定时与失败处理
|
||||
|
||||
定时:
|
||||
|
||||
- 默认每 30 分钟执行一次采集任务。
|
||||
- 第一轮执行全量抓取;全量完成后记录同步游标、已见主题、已见回复和评测 cursor/时间水位。
|
||||
- 首轮全量建议支持断点续跑:每完成一页讨论列表、一个主题详情、一个评测分页后写入进度,失败后从最近进度恢复。
|
||||
- 首轮全量不建议设置过小页数上限,否则会破坏“全抓”目标;建议设置安全保护,例如单次最多连续运行 2 小时或最多抓取 500 页,并允许下次继续。
|
||||
- 本机先用应用内 scheduler 或命令行手动触发验证;服务器部署时再选 systemd timer、cron 或队列 worker。
|
||||
|
||||
失败处理:
|
||||
|
||||
- Steam 请求失败:记录错误,下一轮重试,不删除旧数据。
|
||||
- OpenRouter 请求失败:保留 raw item,标记 `analysis_pending`,下一轮或手动补跑。
|
||||
- JSON 解析失败:保存模型原始输出,进入待复核状态。
|
||||
- 重复采集:通过 source item id 和 content hash 去重。
|
||||
|
||||
## 部署前提
|
||||
|
||||
本机 MVP:
|
||||
|
||||
- 本地数据库
|
||||
- 本地 dashboard
|
||||
- 从 `.env` 读取 OpenRouter API Key
|
||||
- 手动或定时刷新
|
||||
|
||||
服务器部署前需要补充:
|
||||
|
||||
- 访问认证
|
||||
- 持久化数据库位置和备份策略
|
||||
- 后台任务运行方式
|
||||
- 日志与错误告警
|
||||
- OpenRouter 调用预算和速率控制
|
||||
- Steam 抓取频率和 User-Agent 策略
|
||||
|
||||
## 已定实现决策
|
||||
|
||||
- 密钥配置:使用 `.env` / 环境变量,变量名 `OPENROUTER_API_KEY`。
|
||||
- 首轮抓取:全量抓取,支持断点续跑;用运行时间或高页数阈值做安全保护,不用小页数上限替代全量目标。
|
||||
- 负责人字段:按小团队制作人/处理人文本字段设计,暂不接用户账号系统。
|
||||
|
||||
## 当前实现状态
|
||||
|
||||
- 已实现 Python/FastAPI + SQLite MVP。
|
||||
- 已实现 Steam 评测 API 抓取。
|
||||
- 已实现 Steam 讨论区主题与回复抓取。
|
||||
- 已实现 OpenRouter `deepseek/deepseek-v4-pro` 结构化分类。
|
||||
- 已实现 dashboard、手动同步、后台 30 分钟增量同步、处理状态更新。
|
||||
- 已实现局域网服务监听 `0.0.0.0:8000`。
|
||||
- 已实现 Steam 讨论区中文时间解析,支持 `x 小时以前`、`3 月 7 日 下午 4:52`、`2025 年 8 月 9 日 下午 3:29`。
|
||||
- 已补跑完成 2026-05-01 之后 209 条内容的 AI 分析。
|
||||
|
||||
## 后续平台接入约束
|
||||
|
||||
- 新平台不要复制 Steam 私有逻辑;应新增平台采集器,输出统一 `RawItem`。
|
||||
- 新平台继续复用 `raw_items`、`analysis_results`、`work_items`。
|
||||
- 每个平台必须明确稳定去重主键、原始链接、发布时间解析、首轮全量和后续增量策略。
|
||||
- 需要登录态或浏览器自动化的平台,先单独做方案和当前事实验证,再接入同步链路。
|
||||
67
任务/方案/后续社区平台接入指南.md
Normal file
67
任务/方案/后续社区平台接入指南.md
Normal file
@ -0,0 +1,67 @@
|
||||
# 后续社区平台接入指南
|
||||
|
||||
## 当前架构
|
||||
|
||||
当前 MVP 是 Python/FastAPI + SQLite:
|
||||
|
||||
- `app/main.py`:dashboard、手动同步、补跑分析、处理状态更新、后台 30 分钟增量同步。
|
||||
- `app/steam.py`:Steam 评测、讨论区主题和回复采集器。
|
||||
- `app/sync.py`:统一同步流程、入库去重、调用模型分析、补跑分析。
|
||||
- `app/openrouter.py`:OpenRouter `deepseek/deepseek-v4-pro` 结构化分类。
|
||||
- `app/db.py`:SQLite schema。
|
||||
- `app/models.py`:统一原始内容对象 `RawItem`。
|
||||
- `app/cli.py`:命令行入口。
|
||||
|
||||
## 统一数据流
|
||||
|
||||
```text
|
||||
平台采集器 -> RawItem -> raw_items -> OpenRouter -> analysis_results -> work_items -> dashboard
|
||||
```
|
||||
|
||||
新平台不要直接改 dashboard 数据结构。优先让平台采集器输出 `RawItem`,复用现有同步和分析流程。
|
||||
|
||||
## RawItem 字段约定
|
||||
|
||||
新增平台采集器至少要提供:
|
||||
|
||||
- `source`:平台标识,例如 `steam_reviews`、`steam_discussions`。
|
||||
- `source_item_id`:稳定去重主键,必须包含平台和内容 ID。
|
||||
- `source_url`:能跳回原始内容的链接。
|
||||
- `content_type`:内容类型,例如 `review`、`discussion_topic`、`discussion_reply`。
|
||||
- `author_id` / `author_name`:能取到多少填多少。
|
||||
- `title`:帖子标题,没有则为空。
|
||||
- `published_at`:Unix 时间戳,优先提供。
|
||||
- `published_at_text`:平台原始时间文本。
|
||||
- `updated_at_source`:平台原始更新时间,没有则为空。
|
||||
- `content`:送入模型分析的正文。
|
||||
- `raw`:平台原始字段 JSON。
|
||||
|
||||
## 新平台接入步骤
|
||||
|
||||
1. 验证当前事实:页面/API 是否可访问、是否需要登录态、是否有频率限制。
|
||||
2. 定义内容类型和去重主键。
|
||||
3. 实现平台采集器,输出 `list[RawItem]`。
|
||||
4. 在 `app/sync.py` 中接入采集器,保持失败不删除旧数据。
|
||||
5. 跑小样本 smoke test:抓取、去重、AI 分析、dashboard 展示。
|
||||
6. 再做首轮全量策略和后续增量策略。
|
||||
|
||||
## 已知实现决策
|
||||
|
||||
- AI 模型:OpenRouter `deepseek/deepseek-v4-pro`。
|
||||
- Key:`.env` / 环境变量 `OPENROUTER_API_KEY`。
|
||||
- Dashboard 排序:建议回复优先,同组内按发布时间新到旧。
|
||||
- 补跑分析:每批最多 20 条,按 `published_at/collected_at` 新到旧。
|
||||
- 局域网服务:`python -m uvicorn app.main:app --host 0.0.0.0 --port 8000`。
|
||||
- 当前无登录认证,开放到局域网有修改处理状态风险。
|
||||
|
||||
## 新平台方案必须回答
|
||||
|
||||
- 这个平台监控的运营目的是什么?
|
||||
- 抓哪些内容类型?
|
||||
- 首轮是否全量?全量边界是什么?
|
||||
- 后续增量根据什么停止?
|
||||
- 原始链接如何生成?
|
||||
- 发布时间是否可解析?相对时间如何处理?
|
||||
- 是否要抓回复/评论楼中楼?
|
||||
- 是否需要登录态、cookie、API key 或浏览器自动化?
|
||||
- 失败、限流和重复采集如何处理?
|
||||
Loading…
x
Reference in New Issue
Block a user