Add community monitoring MVP

This commit is contained in:
jkboy 2026-05-30 23:30:55 +08:00
parent d7f8450123
commit 912057de0a
18 changed files with 2781 additions and 0 deletions

View File

@ -0,0 +1,61 @@
# Steam Monitor MVP
## 需求确认
- 产品:《帝国幻想乡~TOHOTOPIA》
- Steam AppID`3774440`
- 信息源Steam 评测、Steam 讨论区主题和回复
- 刷新:每 30 分钟;第一轮全量,后续增量
- 分类模型OpenRouter `deepseek/deepseek-v4-pro`
- 密钥:`.env` / 环境变量 `OPENROUTER_API_KEY`
- Dashboard展示分类、原始链接、是否建议回复、处理状态、制作人/处理人备注
## 当前计划
- [x] T1 建立 Python/FastAPI + SQLite MVP。
- [x] T2 实现 Steam 评测 API 抓取。
- [x] T3 实现 Steam 讨论区主题和回复抓取。
- [x] T4 实现 SQLite 去重、处理状态和同步游标。
- [x] T5 实现 OpenRouter 结构化分类。
- [x] T6 实现 dashboard、手动同步、状态更新。
- [x] T7 本机 smoke test 并启动局域网服务。
- [ ] T8 接入下一个社区平台。
## 执行记录
- 2026-05-16创建任务记录开始项目骨架实现。
- 2026-05-16完成 Python/FastAPI + SQLite MVP实现 Steam 评测、讨论区主题和回复抓取dashboard 展示、手动同步、后台 30 分钟增量同步、处理状态更新。
- 2026-05-16本机 smoke test 抓取 384 条:评测 132、讨论主题 75、回复 177。未配置 `OPENROUTER_API_KEY`,模型分析按预期进入 error配置 `.env` 后可补跑。
- 2026-05-16服务已启动在 `http://127.0.0.1:8000`
- 2026-05-16用户补充 `.env` 后发现“补跑分析”视觉无反应。定位为旧 uvicorn 进程未读新 `.env`,且补跑接口同步等待模型调用。已改为按钮立即返回、后台每批 20 条补跑,并在 dashboard 显示“已分析 / 待补跑”。
- 2026-05-16服务改为局域网监听 `0.0.0.0:8000`,当前局域网地址曾检测为 `http://10.27.16.17:8000`
- 2026-05-16修复讨论区排序问题。根因是 Steam 讨论区 `published_at` 未解析,已支持 `x 小时以前``3 月 7 日 下午 4:52``2025 年 8 月 9 日 下午 3:29` 并回填 252 条讨论区记录。
- 2026-05-16按用户要求补跑 2026-05-01 之后内容。共 209 条:评测 132、讨论主题 26、讨论回复 51最终全部 `done`
- 2026-05-16Dashboard 页头新增“最近更新时间”,优先取最近成功同步完成时间,缺失时取最新采集时间。
## 恢复入口
- 方案文档:`任务/方案/steam社区监控一期计划.md`
- README`README.md`
- CLI`python -m app.cli sync --full``python -m app.cli analyze-pending --since 2026-05-01 --limit 20`
- Dashboard`python -m uvicorn app.main:app --host 0.0.0.0 --port 8000`
- 当前服务:局域网监听 `0.0.0.0:8000`
## 当前状态
- 已完成 Steam 一期 MVP。
- 当前数据文件:`data/tohotopia_monitor.sqlite3`
- 当前 dashboard 无登录认证,局域网可访问者可查看和修改处理状态。
- 当前排序:建议回复优先;同组内按发布时间新到旧。
- 当前后台任务FastAPI 启动后每 30 分钟增量同步。
- 当前 OpenRouter key来自 `.env``OPENROUTER_API_KEY`
## 下一阶段入口
添加其它社区平台时:
- 先读 `AGENTS.md``README.md`、本任务文档和 `任务/方案/steam社区监控一期计划.md`
- 新平台采集器应输出 `app.models.RawItem`
- 继续复用 `raw_items``analysis_results``work_items`
- 新平台不要把平台私有字段直接塞到 dashboard 查询条件里;先进入 `raw_json` 和统一字段。
- 需要登录态、API、反爬或浏览器自动化的平台先验证当前事实再实现。

View File

@ -0,0 +1,86 @@
# Twitter Monitor MVP
日期2026-05-16
状态completed
## 背景
用户要求在 Steam MVP 已完成的基础上,新增 X.com/Twitter 玩家反馈采集与处理功能,目标源为 `https://x.com/Tohotopia`,采集范围为所有帖子以及所有回复,首轮全量,增量按时间,继续复用 `RawItem -> raw_items -> OpenRouter -> analysis_results -> work_items -> dashboard` 流程。
## 需求确认
- 做什么:接入 X.com/Twitter 账号 `Tohotopia` 的账号帖子和每帖回复采集,归一为 `RawItem` 并进入现有同步、分析、dashboard 流程。
- 不做什么:不改 dashboard 的核心数据结构;不把 Twitter 私有字段提升为 dashboard 查询字段;不在未登录时伪造空结果。
- 成功标准本机登录态可用时CLI/同步能采集 Twitter 帖子与回复并入库去重,新增内容可进入 OpenRouter 分析和 dashboard 展示。
- 关键约束X.com 当前页面/API/登录态属于动态事实,先以本机 smoke test 验证;采集失败不得删除旧数据。
## 文档/代码预读
- Project AGENTS新渠道单独封装采集、解析、限流、登录态和失败处理运营判断必须可追溯到平台、原始链接、采集时间或批次。
- Relevant docs`README.md``任务/方案/后续社区平台接入指南.md` 明确新平台采集器输出 `app.models.RawItem`,复用三层数据模型。
- Relevant code`app/sync.py` 当前只采 Steam`app/models.py``RawItem` 可容纳 Twitter 数据;`app/db.py` 已有 `raw_json``(source, source_item_id)` 唯一键。
- 已确认事实:已有 `social-media-scraper` skill 支持 X.com 用户时间线和单帖回复,通过已登录 Chrome/CDP profile 拦截 API 输出 JSON/CSV。
- 冲突 / 歧义:用户不确定是否需要登录态;本机 smoke test 已验证当前 profile 检测到 X.com 登录提示。
## 术语与冲突
- Resolved termsTwitter/X 平台标识在代码中使用 `twitter` 作为配置前缀;来源类型使用 `twitter_posts``twitter_replies`
- Conflicts无。
- Follow-up CONTEXT / glossary updates暂无项目级 `CONTEXT.md`,本次术语记录在任务文档。
## 当前计划
- [x] T1 预读文档与现有 Steam 流程代码。
- [x] T2 验证 X.com 目标页与可用采集工具前提。
- [x] T3 制定 Twitter 接入方案和数据映射。
- [x] T4 实现采集器与同步流程接入。
- [x] T5 补充 CLI/配置/文档与任务记录。
- [x] T6 运行 smoke test 验证入库、分析与 dashboard。
## 关键判断与证据
| 判断 | 类型(稳定原理/当前事实/推断) | 证据 | 验证时间 | 未验证项 | 决策影响 |
|------|--------------------------------|------|----------|----------|----------|
| 新平台应输出 `RawItem` 后复用同步链路 | 稳定原理 | README、后续社区平台接入指南、`app/models.py` | 2026-05-16 | 无 | 避免 dashboard 直接依赖 Twitter 私有字段 |
| X.com 当前采集需要登录态 | 当前事实 | `social-media-scraper` 未登录提示;登录后小样本抓到 18 条 | 2026-05-16 | 全量回复数量和耗时 | 实现必须显式处理未登录失败并给出前置条件 |
| 复用已有 CDP 采集脚本比重写 X.com API 更稳妥 | 推断 | 已有 skill 支持 UserTweets/TweetDetail登录后项目同步入口成功入库并分析 | 2026-05-16 | 全量耗时 | 新增项目内适配层读取 JSON 并转 RawItem |
## 执行记录
- 14:00读取 `AGENTS.md``README.md``.codex/tasks/steam-monitor-mvp.md``任务/方案/后续社区平台接入指南.md`,确认新平台接入规则。
- 14:05读取 `social-media-scraper` skill确认 X.com 用户时间线和单帖回复已支持,输出位置可指定。
- 14:10运行 `python C:\Users\jiajiankun\.codex\skills\social-media-scraper\scraper.py https://x.com/Tohotopia --max-no-new 1 --output-dir 任务/验证/twitter-smoke`,结果为当前 Chrome profile 未登录 X.com。
- 14:25新增 `app/twitter.py`,将 `social-media-scraper` 输出的 timeline/thread JSON 转为 `RawItem`,内容类型为 `twitter_post` / `twitter_reply`,来源为 `twitter_posts` / `twitter_replies`
- 14:35扩展 `app/config.py``app/sync.py``app/cli.py``app/main.py`,支持 `TWITTER_ENABLED`、平台级同步、Twitter 单平台 CLI、dashboard 类型筛选。
- 14:42更新 `.env.example``README.md``requirements.txt`,补充 Twitter 登录前提、配置和依赖。
- 14:50修正 Twitter 增量高水位,从“最近同步完成时间”改为“已入库 Twitter 内容的最大发布/采集时间”,避免漏掉发布时间早于同步结束时间的内容。
- 14:55验证 `python -m compileall app` 通过;默认配置 `python -m app.cli sync --platform twitter` 返回 `twitter_skipped=1`;临时启用 Twitter 后返回 `twitter_errors=1``sync_runs.status=partial`,未插入空 Twitter 数据。
- 19:17用户在 CDP Chrome profile 登录 X.com 后,运行 `social-media-scraper` 小样本验证,抓到 18 条 `Tohotopia` 时间线内容。
- 19:21运行项目同步小范围验证`TWITTER_ENABLED=true``TWITTER_INCREMENTAL_MAX_NO_NEW=1``TWITTER_THREAD_MAX_NO_NEW=1``TWITTER_INCREMENTAL_REPLY_PARENT_LIMIT=2``python -m app.cli sync --platform twitter`。结果:`twitter_fetched=26`、新增 22、分析 22、已见 4。
- 19:25数据库确认 Twitter 已入库 18 条 `twitter_posts` 和 4 条 `twitter_replies`;最近同步 `id=12` 状态为 `success`
- 19:24-19:51用户设置 `TWITTER_ENABLED=true` 后启动 `python -m app.cli sync --platform twitter --full`。命令被用户中断后仍有进程存活但 30 秒内无文件增长、CPU 几乎不变,判断为不再推进。
- 19:53停止残留全量进程 `pid=81152`,将 `sync_runs id=13``running` 标记为 `partial`,保留已入库数据。最终 Twitter 数据为 34 条主帖、139 条回复,共 173 条,分析状态全部 `done`
## 当前状态
- 已完成:文档/代码预读X.com 登录态前提验证Twitter 采集适配层、配置、同步、CLI、dashboard 文案和文档更新;编译、未登录失败路径、登录后小范围端到端验证。
- 阻塞:无。
- 下一步:如需执行“所有帖子及所有回复”的首轮全量,启用 `.env``TWITTER_ENABLED=true` 后运行 `python -m app.cli sync --platform twitter --full`
## 五层变更候选
- 无。
## 恢复入口
下次继续时先读:
- 关键文件:`app/twitter.py``app/sync.py``app/config.py``app/cli.py``app/main.py`
- 当前目标:把 `https://x.com/Tohotopia` 的帖子和回复接入现有 RawItem 流程。
- 当前状态实现已完成X.com 登录态已写入 CDP profile小范围同步成功一次全量同步被中断后已清理残留进程并保留 173 条已分析数据。
- 最近完成:清理全量残留进程,将 `sync_runs id=13` 标记为 partial。
- 下一步:如需继续全量,可再次运行 `python -m app.cli sync --platform twitter --full`,现有去重会跳过已入库内容。
- 不要做:不要把未登录导致的失败当作“无数据”;不要改 dashboard 数据模型。
- 已改文件:`.codex/tasks/twitter-monitor-mvp.md``app/twitter.py``app/config.py``app/sync.py``app/cli.py``app/main.py``.env.example``README.md``requirements.txt`
- 验证结果:`python -m compileall app` 通过;默认 Twitter 未启用会跳过;未登录会 partial登录后项目同步成功当前 Twitter 共 34 条主帖和 139 条回复173 条全部 `done`
- 当前阻塞:无。

22
.env.example Normal file
View File

@ -0,0 +1,22 @@
OPENROUTER_API_KEY=
APP_ID=3774440
PRODUCT_NAME=帝国幻想乡~TOHOTOPIA
DATABASE_PATH=data/tohotopia_monitor.sqlite3
SYNC_INTERVAL_MINUTES=30
AUTO_SYNC_ENABLED=true
TWITTER_ENABLED=false
TWITTER_USERNAME=Tohotopia
TWITTER_BROWSER_PROVIDER=existing
TWITTER_OUTPUT_DIR=任务/社媒数据/twitter-monitor
TWITTER_FULL_MAX_NO_NEW=6
TWITTER_INCREMENTAL_MAX_NO_NEW=2
TWITTER_THREAD_MAX_NO_NEW=3
TWITTER_COMMAND_TIMEOUT_SECONDS=900
TWITTER_FULL_REPLY_POST_LIMIT=0
TWITTER_INCREMENTAL_REPLY_PARENT_LIMIT=20
DISCUSSION_FULL_SCAN_MAX_PAGES=500
DISCUSSION_INCREMENTAL_MAX_PAGES=5
FULL_SCAN_TIME_LIMIT_SECONDS=7200
OPENROUTER_MODEL=deepseek/deepseek-v4-pro
OPENROUTER_REFERER=http://localhost:8000
OPENROUTER_TITLE=TOHOTOPIA Steam Monitor

13
.gitignore vendored Normal file
View File

@ -0,0 +1,13 @@
.env
.venv/
__pycache__/
*.pyc
*.pyo
.pytest_cache/
.mypy_cache/
data/
任务/社媒数据/
任务/验证/**/*.json
任务/验证/**/*.csv
任务/验证/**/*.log

33
AGENTS.md Normal file
View File

@ -0,0 +1,33 @@
# AGENTS.md
## 项目定位
本项目是面向新上架独立游戏的社区监控和处理平台,用于分阶段接入社区渠道的信息采集、整理、分析和处理能力。
目标不是一次性做大全渠道,而是先跑通可验证的运营闭环:发现社区信息 → 归一化入库或记录 → 分析优先级 → 形成可处理事项 → 追踪处理结果。
## 领域边界
- 平台关注社区运营工作流,不只做爬虫脚本集合。
- 社区内容处理应区分:原始内容、规范化记录、分析结论、人工处理状态。
- 运营判断必须能追溯到来源平台、原始链接、采集时间或采集批次。
- 新渠道接入时,先明确该渠道在运营中的用途:反馈收集、舆情监控、玩家支持、内容机会、竞品观察或发布效果追踪。
## 渠道接入原则
- 每个渠道单独封装采集、解析、限流、登录态和失败处理逻辑。
- 渠道输出尽量归一到稳定字段,避免上层业务直接依赖页面结构或平台私有字段。
- 同一内容的重复采集、编辑更新、删除不可见、权限变化,需要在渠道方案中显式说明处理策略。
- 涉及外部平台当前 API、页面结构、频率限制或服务条款时以实时验证结果为准。
## 数据优先级
优先保留能支撑运营决策和追溯的信息:
- 来源平台和原始链接
- 作者标识
- 发布时间和采集时间
- 正文或摘要
- 互动指标
- 主题、情绪、问题类型或处理标签
- 当前处理状态和负责人记录

1
app/__init__.py Normal file
View File

@ -0,0 +1 @@
"""TOHOTOPIA community monitor."""

61
app/cli.py Normal file
View File

@ -0,0 +1,61 @@
from __future__ import annotations
import argparse
import json
import time
from .config import get_settings
from .db import init_db, session
from .sync import analyze_pending, run_sync
def _platforms(value: str | None) -> list[str] | None:
if not value:
return None
selected = [part.strip().lower() for part in value.split(",") if part.strip()]
allowed = {"steam", "twitter"}
unknown = sorted(set(selected) - allowed)
if unknown:
raise argparse.ArgumentTypeError(f"Unsupported platform(s): {', '.join(unknown)}")
return selected
def main() -> None:
parser = argparse.ArgumentParser(description="TOHOTOPIA community monitor")
sub = parser.add_subparsers(dest="command", required=True)
sub.add_parser("init-db", help="Initialize SQLite database")
sync_parser = sub.add_parser("sync", help="Fetch community content and analyze new items")
sync_parser.add_argument("--full", action="store_true", help="Run first full scan")
sync_parser.add_argument(
"--platform",
type=_platforms,
help="Comma-separated platform list: steam,twitter. Defaults to all enabled platforms.",
)
analyze_parser = sub.add_parser("analyze-pending", help="Analyze pending/error items")
analyze_parser.add_argument("--limit", type=int, default=50)
analyze_parser.add_argument("--since", help="Only analyze items since YYYY-MM-DD")
args = parser.parse_args()
settings = get_settings()
with session(settings.database_path) as conn:
init_db(conn)
if args.command == "init-db":
result = {"database": str(settings.database_path)}
elif args.command == "sync":
result = run_sync(conn, settings, full=args.full, platforms=args.platform)
elif args.command == "analyze-pending":
since_ts = None
if args.since:
parsed = time.strptime(args.since, "%Y-%m-%d")
since_ts = int(time.mktime(parsed))
result = analyze_pending(conn, settings, limit=args.limit, since_ts=since_ts)
else:
raise SystemExit(f"Unknown command: {args.command}")
print(json.dumps(result, ensure_ascii=False, indent=2))
if __name__ == "__main__":
main()

94
app/config.py Normal file
View File

@ -0,0 +1,94 @@
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
import os
from dotenv import load_dotenv
ROOT_DIR = Path(__file__).resolve().parent.parent
load_dotenv(ROOT_DIR / ".env")
def _int_env(name: str, default: int) -> int:
value = os.getenv(name)
if not value:
return default
return int(value)
def _bool_env(name: str, default: bool) -> bool:
value = os.getenv(name)
if value is None:
return default
return value.strip().lower() in {"1", "true", "yes", "on"}
@dataclass(frozen=True)
class Settings:
app_id: str
product_name: str
database_path: Path
sync_interval_minutes: int
auto_sync_enabled: bool
twitter_enabled: bool
twitter_username: str
twitter_scraper_path: Path
twitter_output_dir: Path
twitter_browser_provider: str
twitter_full_max_no_new: int
twitter_incremental_max_no_new: int
twitter_thread_max_no_new: int
twitter_command_timeout_seconds: int
twitter_full_reply_post_limit: int
twitter_incremental_reply_parent_limit: int
discussion_full_scan_max_pages: int
discussion_incremental_max_pages: int
full_scan_time_limit_seconds: int
openrouter_api_key: str | None
openrouter_model: str
openrouter_referer: str
openrouter_title: str
def get_settings() -> Settings:
database_path = Path(os.getenv("DATABASE_PATH", "data/tohotopia_monitor.sqlite3"))
if not database_path.is_absolute():
database_path = ROOT_DIR / database_path
twitter_scraper_path = Path(
os.getenv(
"TWITTER_SCRAPER_PATH",
str(Path.home() / ".codex" / "skills" / "social-media-scraper" / "scraper.py"),
)
)
if not twitter_scraper_path.is_absolute():
twitter_scraper_path = ROOT_DIR / twitter_scraper_path
twitter_output_dir = Path(os.getenv("TWITTER_OUTPUT_DIR", "任务/社媒数据/twitter-monitor"))
if not twitter_output_dir.is_absolute():
twitter_output_dir = ROOT_DIR / twitter_output_dir
return Settings(
app_id=os.getenv("APP_ID", "3774440"),
product_name=os.getenv("PRODUCT_NAME", "帝国幻想乡~TOHOTOPIA"),
database_path=database_path,
sync_interval_minutes=_int_env("SYNC_INTERVAL_MINUTES", 30),
auto_sync_enabled=_bool_env("AUTO_SYNC_ENABLED", True),
twitter_enabled=_bool_env("TWITTER_ENABLED", False),
twitter_username=os.getenv("TWITTER_USERNAME", "Tohotopia"),
twitter_scraper_path=twitter_scraper_path,
twitter_output_dir=twitter_output_dir,
twitter_browser_provider=os.getenv("TWITTER_BROWSER_PROVIDER", "existing"),
twitter_full_max_no_new=_int_env("TWITTER_FULL_MAX_NO_NEW", 6),
twitter_incremental_max_no_new=_int_env("TWITTER_INCREMENTAL_MAX_NO_NEW", 2),
twitter_thread_max_no_new=_int_env("TWITTER_THREAD_MAX_NO_NEW", 3),
twitter_command_timeout_seconds=_int_env("TWITTER_COMMAND_TIMEOUT_SECONDS", 900),
twitter_full_reply_post_limit=_int_env("TWITTER_FULL_REPLY_POST_LIMIT", 0),
twitter_incremental_reply_parent_limit=_int_env("TWITTER_INCREMENTAL_REPLY_PARENT_LIMIT", 20),
discussion_full_scan_max_pages=_int_env("DISCUSSION_FULL_SCAN_MAX_PAGES", 500),
discussion_incremental_max_pages=_int_env("DISCUSSION_INCREMENTAL_MAX_PAGES", 5),
full_scan_time_limit_seconds=_int_env("FULL_SCAN_TIME_LIMIT_SECONDS", 7200),
openrouter_api_key=os.getenv("OPENROUTER_API_KEY"),
openrouter_model=os.getenv("OPENROUTER_MODEL", "deepseek/deepseek-v4-pro"),
openrouter_referer=os.getenv("OPENROUTER_REFERER", "http://localhost:8000"),
openrouter_title=os.getenv("OPENROUTER_TITLE", "TOHOTOPIA Steam Monitor"),
)

120
app/db.py Normal file
View File

@ -0,0 +1,120 @@
from __future__ import annotations
from contextlib import contextmanager
from pathlib import Path
import json
import sqlite3
from typing import Any, Iterator
def connect(database_path: Path) -> sqlite3.Connection:
database_path.parent.mkdir(parents=True, exist_ok=True)
conn = sqlite3.connect(database_path)
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
return conn
@contextmanager
def session(database_path: Path) -> Iterator[sqlite3.Connection]:
conn = connect(database_path)
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close()
def init_db(conn: sqlite3.Connection) -> None:
conn.executescript(
"""
CREATE TABLE IF NOT EXISTS raw_items (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source TEXT NOT NULL,
source_item_id TEXT NOT NULL,
source_url TEXT NOT NULL,
content_type TEXT NOT NULL,
author_id TEXT,
author_name TEXT,
title TEXT,
published_at INTEGER,
published_at_text TEXT,
collected_at INTEGER NOT NULL,
updated_at_source INTEGER,
content TEXT NOT NULL,
raw_json TEXT NOT NULL,
content_hash TEXT NOT NULL,
analysis_status TEXT NOT NULL DEFAULT 'pending',
UNIQUE(source, source_item_id)
);
CREATE TABLE IF NOT EXISTS analysis_results (
raw_item_id INTEGER PRIMARY KEY,
model TEXT NOT NULL,
sentiment TEXT NOT NULL,
is_positive INTEGER NOT NULL,
is_negative INTEGER NOT NULL,
has_actionable_feedback INTEGER NOT NULL,
feedback_types TEXT NOT NULL,
reply_recommended INTEGER NOT NULL,
reply_priority TEXT NOT NULL,
reply_suggestion TEXT NOT NULL,
summary TEXT NOT NULL,
priority TEXT NOT NULL,
confidence REAL NOT NULL,
reason TEXT NOT NULL,
model_json TEXT NOT NULL,
analyzed_at INTEGER NOT NULL,
FOREIGN KEY(raw_item_id) REFERENCES raw_items(id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS work_items (
raw_item_id INTEGER PRIMARY KEY,
status TEXT NOT NULL DEFAULT 'new',
owner TEXT NOT NULL DEFAULT '',
notes TEXT NOT NULL DEFAULT '',
last_handled_at INTEGER,
created_at INTEGER NOT NULL,
updated_at INTEGER NOT NULL,
FOREIGN KEY(raw_item_id) REFERENCES raw_items(id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS sync_state (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
updated_at INTEGER NOT NULL
);
CREATE TABLE IF NOT EXISTS sync_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at INTEGER NOT NULL,
finished_at INTEGER,
mode TEXT NOT NULL,
status TEXT NOT NULL,
message TEXT NOT NULL DEFAULT '',
stats_json TEXT NOT NULL DEFAULT '{}'
);
CREATE INDEX IF NOT EXISTS idx_raw_items_collected_at ON raw_items(collected_at DESC);
CREATE INDEX IF NOT EXISTS idx_raw_items_content_type ON raw_items(content_type);
CREATE INDEX IF NOT EXISTS idx_raw_items_analysis_status ON raw_items(analysis_status);
CREATE INDEX IF NOT EXISTS idx_work_items_status ON work_items(status);
"""
)
def encode_json(value: Any) -> str:
return json.dumps(value, ensure_ascii=False, separators=(",", ":"))
def decode_json(value: str | None, default: Any = None) -> Any:
if value is None:
return default
try:
return json.loads(value)
except json.JSONDecodeError:
return default

717
app/main.py Normal file
View File

@ -0,0 +1,717 @@
from __future__ import annotations
from hashlib import sha1
from html import escape
import threading
import time
from typing import Any
from fastapi import FastAPI, Form, Query
from fastapi.responses import HTMLResponse, RedirectResponse
from .config import Settings, get_settings
from .db import decode_json, init_db, session
from .models import RawItem
from .openrouter import OpenRouterClient
from .sync import analyze_pending, run_sync, save_analysis, upsert_raw_item
app = FastAPI(title="TOHOTOPIA Steam Monitor")
sync_lock = threading.Lock()
analysis_lock = threading.Lock()
stop_event = threading.Event()
def current_settings() -> Settings:
return get_settings()
def _fmt_ts(value: int | None) -> str:
if not value:
return ""
return time.strftime("%Y-%m-%d %H:%M", time.localtime(int(value)))
def _badge(text: str, cls: str = "") -> str:
return f'<span class="badge {cls}">{escape(text)}</span>'
def _manual_item_id(source_url: str, source_name: str, title: str, author_name: str, content: str) -> str:
seed = source_url.strip() or "\n".join(
[source_name.strip(), title.strip(), author_name.strip(), content.strip()]
)
return sha1(seed.encode("utf-8", errors="ignore")).hexdigest()
def _looks_chinese(text: str) -> bool:
letters = [char for char in text if char.isalpha()]
if not letters:
return True
cjk_count = sum(1 for char in letters if "\u4e00" <= char <= "\u9fff")
return cjk_count / len(letters) >= 0.2
def _query(filters: dict[str, str]) -> tuple[str, list[Any]]:
where = []
params: list[Any] = []
if filters.get("content_type"):
where.append("r.content_type = ?")
params.append(filters["content_type"])
if filters.get("sentiment"):
where.append("a.sentiment = ?")
params.append(filters["sentiment"])
if filters.get("status"):
where.append("w.status = ?")
params.append(filters["status"])
if filters.get("reply") == "1":
where.append("a.reply_recommended = 1")
if filters.get("actionable") == "1":
where.append("a.has_actionable_feedback = 1")
if filters.get("q"):
where.append("(r.content LIKE ? OR r.title LIKE ? OR a.summary LIKE ?)")
like = f"%{filters['q']}%"
params.extend([like, like, like])
clause = "WHERE " + " AND ".join(where) if where else ""
return clause, params
@app.on_event("startup")
def startup() -> None:
settings = current_settings()
with session(settings.database_path) as conn:
init_db(conn)
if settings.auto_sync_enabled:
thread = threading.Thread(target=_sync_loop, name="steam-sync-loop", daemon=True)
thread.start()
@app.on_event("shutdown")
def shutdown() -> None:
stop_event.set()
def _sync_loop() -> None:
settings = current_settings()
interval_seconds = max(settings.sync_interval_minutes, 1) * 60
while not stop_event.wait(interval_seconds):
if not sync_lock.acquire(blocking=False):
continue
try:
with session(settings.database_path) as conn:
run_sync(conn, settings, full=False)
except Exception:
# Sync failures are recorded in sync_runs by run_sync when possible.
pass
finally:
sync_lock.release()
@app.get("/", response_class=HTMLResponse)
def index(
content_type: str = Query(""),
sentiment: str = Query(""),
status: str = Query(""),
reply: str = Query(""),
actionable: str = Query(""),
q: str = Query(""),
manual: str = Query(""),
notice: str = Query(""),
) -> str:
settings = current_settings()
filters = {
"content_type": content_type,
"sentiment": sentiment,
"status": status,
"reply": reply,
"actionable": actionable,
"q": q,
}
with session(settings.database_path) as conn:
clause, params = _query(filters)
rows = conn.execute(
f"""
SELECT r.*, a.sentiment, a.is_positive, a.is_negative,
a.has_actionable_feedback, a.feedback_types, a.reply_recommended,
a.reply_priority, a.reply_suggestion, a.summary, a.priority,
a.confidence, a.reason, w.status, w.owner, w.notes
FROM raw_items r
LEFT JOIN analysis_results a ON a.raw_item_id = r.id
LEFT JOIN work_items w ON w.raw_item_id = r.id
{clause}
ORDER BY
COALESCE(a.reply_recommended, 0) DESC,
COALESCE(r.published_at, r.collected_at) DESC,
r.collected_at DESC,
r.id DESC
LIMIT 200
""",
params,
).fetchall()
metrics = conn.execute(
"""
SELECT
COUNT(*) AS total,
SUM(CASE WHEN w.status = 'new' THEN 1 ELSE 0 END) AS new_count,
SUM(CASE WHEN a.is_negative = 1 THEN 1 ELSE 0 END) AS negative_count,
SUM(CASE WHEN a.has_actionable_feedback = 1 THEN 1 ELSE 0 END) AS actionable_count,
SUM(CASE WHEN a.reply_recommended = 1 THEN 1 ELSE 0 END) AS reply_count,
SUM(CASE WHEN a.priority = 'high' THEN 1 ELSE 0 END) AS high_count,
SUM(CASE WHEN r.analysis_status = 'done' THEN 1 ELSE 0 END) AS analyzed_count,
SUM(CASE WHEN r.analysis_status = 'pending' THEN 1 ELSE 0 END) AS pending_count,
SUM(CASE WHEN r.analysis_status = 'error' THEN 1 ELSE 0 END) AS error_count
FROM raw_items r
LEFT JOIN analysis_results a ON a.raw_item_id = r.id
LEFT JOIN work_items w ON w.raw_item_id = r.id
"""
).fetchone()
last_runs = conn.execute(
"SELECT * FROM sync_runs ORDER BY started_at DESC LIMIT 5"
).fetchall()
last_success = conn.execute(
"""
SELECT finished_at FROM sync_runs
WHERE status = 'success' AND finished_at IS NOT NULL
ORDER BY finished_at DESC
LIMIT 1
"""
).fetchone()
latest_collected = conn.execute(
"SELECT MAX(collected_at) AS collected_at FROM raw_items"
).fetchone()
items_html = "\n".join(_render_item(row) for row in rows)
runs_html = "\n".join(
f"<li>{_fmt_ts(run['started_at'])} {escape(run['mode'])} "
f"{escape(run['status'])} {escape(run['stats_json'] or '')} {escape(run['message'] or '')}</li>"
for run in last_runs
)
return f"""
<!doctype html>
<html lang="zh-CN">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>{escape(settings.product_name)} 社区监控</title>
<style>{CSS}</style>
</head>
<body>
<header>
<div>
<h1>{escape(settings.product_name)} 社区监控</h1>
<p>Steam 与社区平台内容 {settings.sync_interval_minutes} 分钟刷新</p>
<p>最近更新时间{_last_update_text(last_success, latest_collected)}</p>
</div>
<div class="actions">
<form method="post" action="/sync"><button>增量同步</button></form>
<form method="post" action="/sync?full=1"><button class="secondary">全量同步</button></form>
<form method="post" action="/analyze-pending"><button class="secondary">补跑分析</button></form>
<a class="button secondary" href="/?manual=1">手动添加</a>
</div>
</header>
<section class="metrics">
{_metric("总内容", metrics["total"])}
{_metric("未处理", metrics["new_count"])}
{_metric("差评/负面", metrics["negative_count"])}
{_metric("具体反馈", metrics["actionable_count"])}
{_metric("建议回复", metrics["reply_count"])}
{_metric("高优先级", metrics["high_count"])}
{_metric("已分析", metrics["analyzed_count"])}
{_metric("待补跑", (metrics["pending_count"] or 0) + (metrics["error_count"] or 0))}
</section>
{f'<div class="notice">{escape(notice)}</div>' if notice else ''}
{_render_manual_form() if manual == '1' else ''}
<form class="filters" method="get">
{_select("content_type", content_type, {"": "全部类型", "review": "Steam 评测", "discussion_topic": "Steam 帖子", "discussion_reply": "Steam 回复", "twitter_post": "Twitter 帖子", "twitter_reply": "Twitter 回复", "manual_note": "手动添加"})}
{_select("sentiment", sentiment, {"": "全部情绪", "positive": "正面", "negative": "负面", "mixed": "混合", "neutral": "中性"})}
{_select("status", status, {"": "全部状态", "new": "未处理", "read": "已读", "needs_reply": "待回复", "replied": "已回复", "needs_fix": "待修复", "archived": "已归档"})}
<label><input type="checkbox" name="reply" value="1" {'checked' if reply == '1' else ''}> 建议回复</label>
<label><input type="checkbox" name="actionable" value="1" {'checked' if actionable == '1' else ''}> 具体反馈</label>
<input name="q" placeholder="搜索正文/摘要" value="{escape(q)}">
<button>筛选</button>
</form>
<main>{items_html or '<div class="empty">暂无数据。先运行同步。</div>'}</main>
<aside>
<h2>最近同步</h2>
<ul>{runs_html or '<li>暂无同步记录</li>'}</ul>
</aside>
</body>
</html>
"""
@app.post("/sync")
def sync(full: int = Query(0)) -> RedirectResponse:
if sync_lock.acquire(blocking=False):
thread = threading.Thread(target=_run_sync_background, args=(bool(full),), daemon=True)
thread.start()
return RedirectResponse("/?notice=同步已在后台开始,稍后刷新查看结果", status_code=303)
return RedirectResponse("/?notice=已有同步任务正在运行", status_code=303)
@app.post("/analyze-pending")
def analyze() -> RedirectResponse:
if analysis_lock.acquire(blocking=False):
thread = threading.Thread(target=_run_analysis_background, kwargs={"limit": 20}, daemon=True)
thread.start()
return RedirectResponse("/?notice=补跑分析已在后台开始,每批最多 20 条,稍后刷新查看结果", status_code=303)
return RedirectResponse("/?notice=已有补跑分析正在运行", status_code=303)
@app.post("/manual-items")
def create_manual_item(
source_name: str = Form(...),
source_url: str = Form(""),
title: str = Form(""),
author_name: str = Form(""),
published_at_text: str = Form(""),
content: str = Form(...),
status: str = Form("new"),
owner: str = Form(""),
notes: str = Form(""),
) -> RedirectResponse:
source_name = source_name.strip()
source_url = source_url.strip()
title = title.strip()
author_name = author_name.strip()
published_at_text = published_at_text.strip()
content = content.strip()
status = status if status in _work_status_options() else "new"
if not source_name or not content:
return RedirectResponse("/?manual=1&notice=来源社群和正文不能为空", status_code=303)
original_content = content
translated = False
analysis_error = ""
settings = current_settings()
analyzer = OpenRouterClient(settings)
try:
if not _looks_chinese(content):
content = analyzer.translate_to_chinese(content)
translated = content != original_content
except Exception as exc: # noqa: BLE001 - keep manual entry even if translation fails
analysis_error = f"翻译失败,已保留原文并标记待补跑:{exc}"
item = RawItem(
source="manual",
source_item_id=_manual_item_id(source_url, source_name, title, author_name, content),
source_url=source_url,
content_type="manual_note",
author_id=None,
author_name=author_name or source_name,
title=title or f"{source_name} 手动信息",
published_at=None,
published_at_text=published_at_text,
updated_at_source=None,
content=content,
raw={
"source_name": source_name,
"source_url": source_url,
"title": title,
"author_name": author_name,
"published_at_text": published_at_text,
"original_content": original_content,
"translated_to_zh": translated,
"manual": True,
},
)
now = int(time.time())
try:
with session(settings.database_path) as conn:
raw_item_id, inserted = upsert_raw_item(conn, item)
conn.execute(
"""
UPDATE work_items
SET status = ?, owner = ?, notes = ?, updated_at = ?,
last_handled_at = CASE WHEN ? != 'new' THEN ? ELSE last_handled_at END
WHERE raw_item_id = ?
""",
(status, owner.strip(), notes.strip(), now, status, now, raw_item_id),
)
if not analysis_error:
try:
analysis = analyzer.analyze(item)
save_analysis(conn, raw_item_id, settings.openrouter_model, analysis)
except Exception as exc: # noqa: BLE001 - keep pending/error for analyze-pending
analysis_error = f"分析失败,已标记待补跑:{exc}"
conn.execute(
"UPDATE raw_items SET analysis_status = 'error' WHERE id = ?",
(raw_item_id,),
)
finally:
analyzer.close()
parts = ["已添加手动信息" if inserted else "已更新同来源手动信息"]
if translated:
parts.append("已翻译成中文")
if analysis_error:
parts.append(analysis_error)
else:
parts.append("已生成是否回复和回复建议")
notice = "".join(parts)
return RedirectResponse(f"/?notice={notice}", status_code=303)
@app.post("/items/{raw_item_id}/work")
def update_work(
raw_item_id: int,
status: str = Form(...),
owner: str = Form(""),
notes: str = Form(""),
) -> RedirectResponse:
settings = current_settings()
now = int(time.time())
with session(settings.database_path) as conn:
conn.execute(
"""
UPDATE work_items
SET status = ?, owner = ?, notes = ?, updated_at = ?,
last_handled_at = CASE WHEN ? != 'new' THEN ? ELSE last_handled_at END
WHERE raw_item_id = ?
""",
(status, owner, notes, now, status, now, raw_item_id),
)
return RedirectResponse("/", status_code=303)
def _run_sync_background(full: bool) -> None:
settings = current_settings()
try:
with session(settings.database_path) as conn:
run_sync(conn, settings, full=full)
finally:
sync_lock.release()
def _run_analysis_background(limit: int) -> None:
settings = current_settings()
try:
with session(settings.database_path) as conn:
analyze_pending(conn, settings, limit=limit)
finally:
analysis_lock.release()
def _notice_text(stats: dict[str, Any]) -> str:
if not stats:
return "无待处理项目"
return "".join(f"{key}={value}" for key, value in stats.items())
def _last_update_text(last_success: Any, latest_collected: Any) -> str:
if last_success and last_success["finished_at"]:
return _fmt_ts(last_success["finished_at"])
if latest_collected and latest_collected["collected_at"]:
return _fmt_ts(latest_collected["collected_at"])
return "暂无"
def _metric(label: str, value: Any) -> str:
return f'<div class="metric"><span>{escape(label)}</span><strong>{int(value or 0)}</strong></div>'
def _select(name: str, current: str, options: dict[str, str]) -> str:
option_html = "".join(
f'<option value="{escape(value)}" {"selected" if value == current else ""}>{escape(label)}</option>'
for value, label in options.items()
)
return f'<select name="{escape(name)}">{option_html}</select>'
def _work_status_options() -> dict[str, str]:
return {
"new": "未处理",
"read": "已读",
"needs_reply": "待回复",
"replied": "已回复",
"needs_fix": "待修复",
"archived": "已归档",
}
def _render_manual_form() -> str:
return f"""
<section class="manual-panel">
<h2>手动添加社区信息</h2>
<form class="manual-form" method="post" action="/manual-items">
<input name="source_name" placeholder="来源社群/平台,例如 Discord、小红书、QQ群" required>
<input name="source_url" placeholder="原始链接,可留空">
<input name="title" placeholder="标题,可留空">
<input name="author_name" placeholder="作者/昵称,可留空">
<input name="published_at_text" placeholder="发布时间文本,可留空">
<textarea name="content" placeholder="正文/摘要" required></textarea>
{_select("status", "new", _work_status_options())}
<input name="owner" placeholder="制作人/处理人">
<input name="notes" placeholder="备注">
<button>添加</button>
</form>
</section>
"""
def _render_item(row: Any) -> str:
feedback_types = ", ".join(decode_json(row["feedback_types"], [])) if row["feedback_types"] else ""
cls = "item urgent" if row["reply_recommended"] or row["priority"] == "high" else "item"
badges = [
_badge(row["content_type"] or "", "type"),
_badge(row["sentiment"] or "pending", row["sentiment"] or ""),
_badge(row["priority"] or "low", "priority"),
]
if row["has_actionable_feedback"]:
badges.append(_badge("具体反馈", "action"))
if row["reply_recommended"]:
badges.append(_badge("建议回复", "reply"))
content = escape(row["content"] or "")
if len(content) > 900:
content = content[:900] + "..."
return f"""
<article class="{cls}">
<div class="item-head">
<div>
<h2>{escape(row['summary'] or row['title'] or '未分析')}</h2>
<div class="meta">{' '.join(badges)} <span>{escape(row['author_name'] or '')}</span> <span>{_fmt_ts(row['published_at']) or escape(row['published_at_text'] or '')}</span></div>
</div>
{_source_link(row['source_url'])}
</div>
<p class="content">{content}</p>
<p class="reason">{escape(row['reason'] or '')}</p>
<p class="reply-suggestion">{escape(row['reply_suggestion'] or '')}</p>
<p class="types">{escape(feedback_types)}</p>
<form class="work" method="post" action="/items/{row['id']}/work">
{_select("status", row["status"] or "new", _work_status_options())}
<input name="owner" placeholder="制作人/处理人" value="{escape(row['owner'] or '')}">
<input name="notes" placeholder="备注" value="{escape(row['notes'] or '')}">
<button>保存</button>
</form>
</article>
"""
def _source_link(source_url: str | None) -> str:
if not source_url:
return '<span class="source muted">无原始链接</span>'
if not source_url.startswith(("http://", "https://")):
return f'<span class="source muted">{escape(source_url)}</span>'
return (
f'<a class="source" href="{escape(source_url)}" target="_blank" '
f'rel="noreferrer">原始链接</a>'
)
CSS = """
:root {
color-scheme: light;
font-family: Inter, "Segoe UI", "Microsoft YaHei", sans-serif;
background: #f6f7f9;
color: #1f2933;
}
body {
margin: 0;
}
header {
display: flex;
justify-content: space-between;
gap: 24px;
align-items: center;
padding: 24px 32px;
background: #ffffff;
border-bottom: 1px solid #d9dee7;
}
h1 {
margin: 0 0 4px;
font-size: 24px;
}
p {
line-height: 1.5;
}
header p {
margin: 0;
color: #64748b;
}
.actions {
display: flex;
gap: 8px;
flex-wrap: wrap;
}
button, .button, select, input, textarea {
min-height: 36px;
border: 1px solid #cbd5e1;
border-radius: 6px;
padding: 0 12px;
background: #fff;
font: inherit;
}
button, .button {
display: inline-flex;
align-items: center;
background: #166534;
color: white;
border-color: #166534;
cursor: pointer;
text-decoration: none;
}
button.secondary, .button.secondary {
background: #334155;
border-color: #334155;
}
.metrics {
display: grid;
grid-template-columns: repeat(6, minmax(120px, 1fr));
gap: 12px;
padding: 18px 32px;
}
.metric {
background: #fff;
border: 1px solid #d9dee7;
border-radius: 8px;
padding: 14px;
}
.metric span {
display: block;
color: #64748b;
font-size: 13px;
}
.metric strong {
display: block;
font-size: 26px;
margin-top: 6px;
}
.filters {
display: flex;
gap: 10px;
flex-wrap: wrap;
align-items: center;
padding: 0 32px 18px;
}
.manual-panel {
margin: 0 32px 18px;
padding: 18px;
border: 1px solid #d9dee7;
border-radius: 8px;
background: #fff;
}
.manual-panel h2 {
margin: 0 0 12px;
font-size: 17px;
}
.manual-form {
display: grid;
grid-template-columns: repeat(3, minmax(160px, 1fr));
gap: 10px;
}
.manual-form textarea {
grid-column: 1 / -1;
min-height: 120px;
padding: 10px 12px;
resize: vertical;
}
.notice {
margin: 0 32px 18px;
padding: 12px 14px;
border: 1px solid #86efac;
border-radius: 8px;
background: #f0fdf4;
color: #166534;
}
main {
display: grid;
gap: 14px;
padding: 0 32px 24px;
}
.item {
background: #fff;
border: 1px solid #d9dee7;
border-radius: 8px;
padding: 18px;
}
.item.urgent {
border-color: #dc2626;
box-shadow: inset 4px 0 0 #dc2626;
}
.item-head {
display: flex;
justify-content: space-between;
gap: 16px;
align-items: flex-start;
}
.item h2 {
margin: 0 0 8px;
font-size: 17px;
}
.meta {
display: flex;
gap: 8px;
align-items: center;
flex-wrap: wrap;
color: #64748b;
font-size: 13px;
}
.badge {
display: inline-flex;
align-items: center;
min-height: 24px;
padding: 0 8px;
border-radius: 999px;
background: #e2e8f0;
color: #334155;
}
.badge.negative, .badge.reply {
background: #fee2e2;
color: #991b1b;
}
.badge.positive {
background: #dcfce7;
color: #166534;
}
.badge.action {
background: #fef3c7;
color: #92400e;
}
.source {
color: #166534;
white-space: nowrap;
}
.source.muted {
color: #64748b;
}
.content {
white-space: pre-wrap;
}
.reason, .reply-suggestion, .types {
color: #475569;
margin: 8px 0;
}
.reply-suggestion {
font-weight: 600;
}
.work {
display: grid;
grid-template-columns: 150px minmax(140px, 220px) 1fr 80px;
gap: 8px;
margin-top: 12px;
}
aside {
padding: 0 32px 32px;
color: #475569;
}
.empty {
background: #fff;
border: 1px solid #d9dee7;
border-radius: 8px;
padding: 32px;
}
@media (max-width: 900px) {
header, .item-head {
flex-direction: column;
}
.metrics {
grid-template-columns: repeat(2, minmax(120px, 1fr));
}
.work {
grid-template-columns: 1fr;
}
.manual-form {
grid-template-columns: 1fr;
}
}
"""

20
app/models.py Normal file
View File

@ -0,0 +1,20 @@
from __future__ import annotations
from dataclasses import dataclass
from typing import Any
@dataclass(frozen=True)
class RawItem:
source: str
source_item_id: str
source_url: str
content_type: str
author_id: str | None
author_name: str | None
title: str | None
published_at: int | None
published_at_text: str | None
updated_at_source: int | None
content: str
raw: dict[str, Any]

238
app/openrouter.py Normal file
View File

@ -0,0 +1,238 @@
from __future__ import annotations
import json
import re
from typing import Any
import httpx
from .config import Settings
from .models import RawItem
DEFAULT_ANALYSIS = {
"sentiment": "neutral",
"is_positive": False,
"is_negative": False,
"has_actionable_feedback": False,
"feedback_types": [],
"reply_recommended": False,
"reply_priority": "none",
"reply_suggestion": "",
"summary": "",
"priority": "low",
"confidence": 0.0,
"reason": "",
}
TRANSLATION_SCHEMA = {
"type": "object",
"properties": {
"translated_content": {"type": "string"},
},
"required": ["translated_content"],
"additionalProperties": False,
}
SCHEMA = {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "mixed", "neutral"]},
"is_positive": {"type": "boolean"},
"is_negative": {"type": "boolean"},
"has_actionable_feedback": {"type": "boolean"},
"feedback_types": {
"type": "array",
"items": {
"type": "string",
"enum": [
"bug",
"suggestion",
"balance",
"ui",
"localization",
"performance",
"pricing",
"content",
"question",
"other",
],
},
},
"reply_recommended": {"type": "boolean"},
"reply_priority": {"type": "string", "enum": ["none", "low", "medium", "high"]},
"reply_suggestion": {"type": "string"},
"summary": {"type": "string"},
"priority": {"type": "string", "enum": ["low", "medium", "high"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"reason": {"type": "string"},
},
"required": [
"sentiment",
"is_positive",
"is_negative",
"has_actionable_feedback",
"feedback_types",
"reply_recommended",
"reply_priority",
"reply_suggestion",
"summary",
"priority",
"confidence",
"reason",
],
"additionalProperties": False,
}
class OpenRouterClient:
def __init__(self, settings: Settings) -> None:
self.settings = settings
self.enabled = bool(settings.openrouter_api_key)
self.client = httpx.Client(timeout=60)
def close(self) -> None:
self.client.close()
def analyze(self, item: RawItem) -> dict[str, Any]:
if not self.enabled:
raise MissingOpenRouterKey("OPENROUTER_API_KEY is not configured")
payload = {
"model": self.settings.openrouter_model,
"messages": [
{
"role": "system",
"content": (
"你是独立游戏《帝国幻想乡~TOHOTOPIA》的社区运营助手。"
"请判断 Steam、Twitter/X 等社区内容的情绪、是否包含具体可处理反馈、"
"以及是否建议制作人回复。summary、reason、reply_suggestion 必须使用中文。"
"只输出符合 JSON Schema 的 JSON。"
),
},
{
"role": "user",
"content": self._prompt(item),
},
],
"temperature": 0.1,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "community_item_analysis",
"strict": True,
"schema": SCHEMA,
},
},
}
headers = {
"Authorization": f"Bearer {self.settings.openrouter_api_key}",
"HTTP-Referer": self.settings.openrouter_referer,
"X-Title": self.settings.openrouter_title,
}
response = self.client.post(
"https://openrouter.ai/api/v1/chat/completions",
headers=headers,
json=payload,
)
response.raise_for_status()
data = response.json()
content = data["choices"][0]["message"]["content"]
parsed = self._parse_json(content)
return self._normalize(parsed)
def translate_to_chinese(self, content: str) -> str:
if not self.enabled:
raise MissingOpenRouterKey("OPENROUTER_API_KEY is not configured")
payload = {
"model": self.settings.openrouter_model,
"messages": [
{
"role": "system",
"content": (
"你是独立游戏社区运营翻译助手。"
"把用户提供的社区内容准确翻译成简体中文,保留原意、语气、问题细节、游戏术语、链接和编号。"
"不要添加解释。只输出符合 JSON Schema 的 JSON。"
),
},
{
"role": "user",
"content": content[:6000],
},
],
"temperature": 0,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "manual_item_translation",
"strict": True,
"schema": TRANSLATION_SCHEMA,
},
},
}
headers = {
"Authorization": f"Bearer {self.settings.openrouter_api_key}",
"HTTP-Referer": self.settings.openrouter_referer,
"X-Title": self.settings.openrouter_title,
}
response = self.client.post(
"https://openrouter.ai/api/v1/chat/completions",
headers=headers,
json=payload,
)
response.raise_for_status()
data = response.json()
parsed = self._parse_json(data["choices"][0]["message"]["content"])
translated = str(parsed.get("translated_content") or "").strip()
return translated or content
def _prompt(self, item: RawItem) -> str:
metadata = {
"source": item.source,
"content_type": item.content_type,
"source_url": item.source_url,
"author": item.author_name,
"title": item.title,
"steam_review_voted_up": item.raw.get("voted_up"),
"language": item.raw.get("language"),
"in_reply_to": item.raw.get("parent_url") or item.raw.get("in_reply_to"),
"likes": item.raw.get("likes"),
"replies": item.raw.get("replies"),
"retweets": item.raw.get("retweets"),
"views": item.raw.get("views"),
}
return (
"请分析以下社区内容。\n\n"
f"元数据:{json.dumps(metadata, ensure_ascii=False)}\n\n"
f"正文:\n{item.content[:6000]}"
)
def _parse_json(self, content: str) -> dict[str, Any]:
try:
return json.loads(content)
except json.JSONDecodeError:
match = re.search(r"\{.*\}", content, re.S)
if not match:
raise
return json.loads(match.group(0))
def _normalize(self, value: dict[str, Any]) -> dict[str, Any]:
result = dict(DEFAULT_ANALYSIS)
result.update(value)
result["feedback_types"] = list(result.get("feedback_types") or [])
result["is_positive"] = bool(result.get("is_positive"))
result["is_negative"] = bool(result.get("is_negative"))
result["has_actionable_feedback"] = bool(result.get("has_actionable_feedback"))
result["reply_recommended"] = bool(result.get("reply_recommended"))
try:
result["confidence"] = float(result.get("confidence", 0.0))
except (TypeError, ValueError):
result["confidence"] = 0.0
return result
class MissingOpenRouterKey(RuntimeError):
pass

321
app/steam.py Normal file
View File

@ -0,0 +1,321 @@
from __future__ import annotations
from hashlib import sha1
import re
import time
from typing import Any, Iterable
from urllib.parse import parse_qs, quote, urljoin, urlparse
from bs4 import BeautifulSoup
import httpx
from .models import RawItem
STEAM_STORE = "https://store.steampowered.com"
STEAM_COMMUNITY = "https://steamcommunity.com"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/125.0 Safari/537.36",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7",
}
def content_hash(text: str) -> str:
return sha1(text.encode("utf-8", errors="ignore")).hexdigest()
def _text(node: Any) -> str:
return node.get_text(separator="\n", strip=True) if node else ""
def _abs_url(url: str) -> str:
return urljoin(STEAM_COMMUNITY, url)
def _topic_id_from_url(url: str) -> str:
match = re.search(r"/discussions/[^/]+/(\d+)", url)
if match:
return match.group(1)
return content_hash(url)
def _reply_id(comment: Any, topic_id: str, author: str, timestamp: str, text: str) -> str:
node_id = comment.get("id", "")
if node_id:
return node_id
data_id = comment.get("data-commentid", "")
if data_id:
return data_id
return f"{topic_id}:{content_hash(author + timestamp + text)}"
def parse_steam_time(text: str | None, now: int | None = None) -> int | None:
if not text:
return None
value = text.strip()
now_ts = now or int(time.time())
relative = re.match(r"^(\d+)\s*(分钟|小时|天|minute|minutes|hour|hours|day|days)\s*(以前|ago)?$", value, re.I)
if relative:
amount = int(relative.group(1))
unit = relative.group(2).lower()
seconds = {
"分钟": 60,
"minute": 60,
"minutes": 60,
"小时": 3600,
"hour": 3600,
"hours": 3600,
"": 86400,
"day": 86400,
"days": 86400,
}[unit]
return now_ts - amount * seconds
absolute = re.match(
r"^(\d{1,2})\s*月\s*(\d{1,2})\s*日\s*(上午|下午)\s*(\d{1,2}):(\d{2})$",
value,
)
if absolute:
current = time.localtime(now_ts)
return _make_ts(
current.tm_year,
int(absolute.group(1)),
int(absolute.group(2)),
absolute.group(3),
int(absolute.group(4)),
int(absolute.group(5)),
)
absolute_with_year = re.match(
r"^(\d{4})\s*年\s*(\d{1,2})\s*月\s*(\d{1,2})\s*日\s*(上午|下午)\s*(\d{1,2}):(\d{2})$",
value,
)
if absolute_with_year:
return _make_ts(
int(absolute_with_year.group(1)),
int(absolute_with_year.group(2)),
int(absolute_with_year.group(3)),
absolute_with_year.group(4),
int(absolute_with_year.group(5)),
int(absolute_with_year.group(6)),
)
return None
def _make_ts(year: int, month: int, day: int, ampm: str, hour: int, minute: int) -> int:
if ampm == "下午" and hour != 12:
hour += 12
if ampm == "上午" and hour == 12:
hour = 0
return int(time.mktime((year, month, day, hour, minute, 0, -1, -1, -1)))
class SteamClient:
def __init__(self, app_id: str) -> None:
self.app_id = app_id
self.client = httpx.Client(headers=HEADERS, timeout=30, follow_redirects=True)
self.client.cookies.set("birthtime", "568022401", domain="steamcommunity.com")
def close(self) -> None:
self.client.close()
def fetch_reviews(self, max_pages: int | None = None) -> list[RawItem]:
cursor = "*"
page = 0
items: list[RawItem] = []
while True:
params = {
"json": "1",
"num_per_page": "100",
"language": "all",
"filter": "recent",
"purchase_type": "all",
"cursor": cursor,
}
response = self.client.get(f"{STEAM_STORE}/appreviews/{self.app_id}", params=params)
response.raise_for_status()
data = response.json()
reviews = data.get("reviews") or []
if not reviews:
break
for review in reviews:
items.append(self._review_to_item(review))
new_cursor = data.get("cursor") or cursor
page += 1
if new_cursor == cursor:
break
if max_pages and page >= max_pages:
break
cursor = new_cursor
time.sleep(0.25)
return items
def fetch_discussions(self, full: bool, max_pages: int, time_limit_seconds: int) -> list[RawItem]:
started = time.monotonic()
topic_urls: list[str] = []
seen_urls: set[str] = set()
for page in range(1, max_pages + 1):
if time.monotonic() - started > time_limit_seconds:
break
url = f"{STEAM_COMMUNITY}/app/{self.app_id}/discussions/"
if page > 1:
url = f"{url}?fp={page}"
html = self._get_text(url)
urls = self._extract_topic_urls(html)
new_urls = [u for u in urls if u not in seen_urls]
if not new_urls:
break
topic_urls.extend(new_urls)
seen_urls.update(new_urls)
if not full and page >= max_pages:
break
time.sleep(0.25)
items: list[RawItem] = []
for url in topic_urls:
if time.monotonic() - started > time_limit_seconds:
break
items.extend(self.fetch_discussion_topic(url))
time.sleep(0.35)
return items
def fetch_discussion_topic(self, url: str) -> list[RawItem]:
html = self._get_text(url)
soup = BeautifulSoup(html, "html.parser")
topic_id = _topic_id_from_url(url)
title = _text(soup.select_one("div.topic")) or _text(soup.select_one(".forum_topic_name"))
items: list[RawItem] = []
op = soup.select_one(".forum_op")
if op:
author_el = op.select_one(".authorline a")
date_el = op.select_one(".date")
date_text = _text(date_el)
content_el = op.select_one(".content")
author = _text(author_el)
content = _text(content_el)
source_url = url
if content:
items.append(
RawItem(
source="steam_discussions",
source_item_id=f"topic:{topic_id}",
source_url=source_url,
content_type="discussion_topic",
author_id=self._steam_id_from_author(author_el),
author_name=author,
title=title,
published_at=parse_steam_time(date_text),
published_at_text=date_text,
updated_at_source=None,
content=content,
raw={
"topic_id": topic_id,
"topic_url": url,
"title": title,
"author": author,
"date": date_text,
"content": content,
},
)
)
for comment in soup.select(".commentthread_comment"):
author_el = comment.select_one(".commentthread_author_link")
date_el = comment.select_one(".commentthread_comment_timestamp")
text_el = comment.select_one(".commentthread_comment_text")
text = _text(text_el)
if not text:
continue
author = _text(author_el)
timestamp = _text(date_el)
reply_id = _reply_id(comment, topic_id, author, timestamp, text)
reply_url = f"{url}#{reply_id}" if reply_id else url
items.append(
RawItem(
source="steam_discussions",
source_item_id=f"reply:{topic_id}:{reply_id}",
source_url=reply_url,
content_type="discussion_reply",
author_id=self._steam_id_from_author(author_el),
author_name=author,
title=title,
published_at=parse_steam_time(timestamp),
published_at_text=timestamp,
updated_at_source=None,
content=text,
raw={
"topic_id": topic_id,
"topic_url": url,
"reply_id": reply_id,
"reply_url": reply_url,
"title": title,
"reply_author": author,
"reply_time_text": timestamp,
"reply_content": text,
},
)
)
return items
def _review_to_item(self, review: dict[str, Any]) -> RawItem:
author = review.get("author") or {}
steam_id = str(author.get("steamid") or "")
recommendation_id = str(review.get("recommendationid"))
source_url = f"{STEAM_COMMUNITY}/profiles/{steam_id}/recommended/{self.app_id}/"
raw = dict(review)
raw["source_url"] = source_url
return RawItem(
source="steam_reviews",
source_item_id=f"review:{recommendation_id}",
source_url=source_url,
content_type="review",
author_id=steam_id or None,
author_name=author.get("personaname"),
title=None,
published_at=review.get("timestamp_created"),
published_at_text=None,
updated_at_source=review.get("timestamp_updated"),
content=review.get("review") or "",
raw=raw,
)
def _get_text(self, url: str) -> str:
response = self.client.get(url)
response.raise_for_status()
response.encoding = "utf-8"
return response.text
def _extract_topic_urls(self, html: str) -> list[str]:
soup = BeautifulSoup(html, "html.parser")
urls: list[str] = []
for link in soup.select("a.forum_topic_overlay, a.forum_topic_name"):
href = link.get("href")
if not href:
continue
url = _abs_url(href).split("?")[0]
if f"/app/{self.app_id}/discussions/" in url and url not in urls:
urls.append(url)
return urls
def _steam_id_from_author(self, author_el: Any) -> str | None:
if not author_el:
return None
href = author_el.get("href") or ""
parsed = urlparse(href)
if "/profiles/" in parsed.path:
return parsed.path.rstrip("/").split("/")[-1]
if "/id/" in parsed.path:
return parsed.path.rstrip("/").split("/")[-1]
query = parse_qs(parsed.query)
steam_id = query.get("steamid")
return steam_id[0] if steam_id else None
def iter_nonempty(items: Iterable[RawItem]) -> Iterable[RawItem]:
for item in items:
if item.content.strip():
yield item

366
app/sync.py Normal file
View File

@ -0,0 +1,366 @@
from __future__ import annotations
from collections import Counter
from hashlib import sha1
import sqlite3
import time
from typing import Any
from .config import Settings
from .db import decode_json, encode_json, init_db
from .models import RawItem
from .openrouter import OpenRouterClient
from .steam import SteamClient, iter_nonempty
from .twitter import TwitterClient, TwitterScrapeOptions
def _now() -> int:
return int(time.time())
def _hash(text: str) -> str:
return sha1(text.encode("utf-8", errors="ignore")).hexdigest()
def upsert_raw_item(conn: sqlite3.Connection, item: RawItem) -> tuple[int, bool]:
now = _now()
item_hash = _hash(item.content)
existing = conn.execute(
"SELECT id, content_hash FROM raw_items WHERE source = ? AND source_item_id = ?",
(item.source, item.source_item_id),
).fetchone()
if existing:
if existing["content_hash"] != item_hash:
conn.execute(
"""
UPDATE raw_items
SET source_url = ?, author_id = ?, author_name = ?, title = ?,
published_at = ?, published_at_text = ?, updated_at_source = ?,
content = ?, raw_json = ?, content_hash = ?, analysis_status = 'pending',
collected_at = ?
WHERE id = ?
""",
(
item.source_url,
item.author_id,
item.author_name,
item.title,
item.published_at,
item.published_at_text,
item.updated_at_source,
item.content,
encode_json(item.raw),
item_hash,
now,
existing["id"],
),
)
return int(existing["id"]), False
cursor = conn.execute(
"""
INSERT INTO raw_items (
source, source_item_id, source_url, content_type, author_id, author_name,
title, published_at, published_at_text, collected_at, updated_at_source,
content, raw_json, content_hash, analysis_status
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 'pending')
""",
(
item.source,
item.source_item_id,
item.source_url,
item.content_type,
item.author_id,
item.author_name,
item.title,
item.published_at,
item.published_at_text,
now,
item.updated_at_source,
item.content,
encode_json(item.raw),
item_hash,
),
)
raw_item_id = int(cursor.lastrowid)
conn.execute(
"""
INSERT INTO work_items (raw_item_id, status, owner, notes, created_at, updated_at)
VALUES (?, 'new', '', '', ?, ?)
""",
(raw_item_id, now, now),
)
return raw_item_id, True
def save_analysis(
conn: sqlite3.Connection,
raw_item_id: int,
model: str,
analysis: dict[str, Any],
) -> None:
now = _now()
conn.execute(
"""
INSERT INTO analysis_results (
raw_item_id, model, sentiment, is_positive, is_negative,
has_actionable_feedback, feedback_types, reply_recommended, reply_priority,
reply_suggestion, summary, priority, confidence, reason, model_json, analyzed_at
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(raw_item_id) DO UPDATE SET
model = excluded.model,
sentiment = excluded.sentiment,
is_positive = excluded.is_positive,
is_negative = excluded.is_negative,
has_actionable_feedback = excluded.has_actionable_feedback,
feedback_types = excluded.feedback_types,
reply_recommended = excluded.reply_recommended,
reply_priority = excluded.reply_priority,
reply_suggestion = excluded.reply_suggestion,
summary = excluded.summary,
priority = excluded.priority,
confidence = excluded.confidence,
reason = excluded.reason,
model_json = excluded.model_json,
analyzed_at = excluded.analyzed_at
""",
(
raw_item_id,
model,
analysis["sentiment"],
int(analysis["is_positive"]),
int(analysis["is_negative"]),
int(analysis["has_actionable_feedback"]),
encode_json(analysis["feedback_types"]),
int(analysis["reply_recommended"]),
analysis["reply_priority"],
analysis["reply_suggestion"],
analysis["summary"],
analysis["priority"],
analysis["confidence"],
analysis["reason"],
encode_json(analysis),
now,
),
)
conn.execute("UPDATE raw_items SET analysis_status = 'done' WHERE id = ?", (raw_item_id,))
def _twitter_high_watermark_ts(conn: sqlite3.Connection) -> int | None:
row = conn.execute(
"""
SELECT MAX(COALESCE(published_at, collected_at)) AS watermark
FROM raw_items
WHERE source IN ('twitter_posts', 'twitter_replies')
"""
).fetchone()
if row and row["watermark"]:
return int(row["watermark"])
return None
def _recent_twitter_post_urls(conn: sqlite3.Connection, limit: int) -> list[str]:
if limit <= 0:
return []
rows = conn.execute(
"""
SELECT source_url
FROM raw_items
WHERE source = 'twitter_posts'
ORDER BY COALESCE(published_at, collected_at) DESC, collected_at DESC
LIMIT ?
""",
(limit,),
).fetchall()
return [str(row["source_url"]) for row in rows if row["source_url"]]
def _twitter_options(settings: Settings) -> TwitterScrapeOptions:
return TwitterScrapeOptions(
username=settings.twitter_username,
scraper_path=settings.twitter_scraper_path,
output_dir=settings.twitter_output_dir,
browser_provider=settings.twitter_browser_provider,
full_max_no_new=settings.twitter_full_max_no_new,
incremental_max_no_new=settings.twitter_incremental_max_no_new,
thread_max_no_new=settings.twitter_thread_max_no_new,
command_timeout_seconds=settings.twitter_command_timeout_seconds,
full_reply_post_limit=settings.twitter_full_reply_post_limit,
incremental_reply_parent_limit=settings.twitter_incremental_reply_parent_limit,
)
def run_sync(
conn: sqlite3.Connection,
settings: Settings,
full: bool = False,
platforms: list[str] | None = None,
) -> dict[str, Any]:
init_db(conn)
started = _now()
mode = "full" if full else "incremental"
run_id = conn.execute(
"INSERT INTO sync_runs (started_at, mode, status) VALUES (?, ?, 'running')",
(started, mode),
).lastrowid
conn.commit()
stats: Counter[str] = Counter()
messages: list[str] = []
try:
enabled_platforms = platforms or ["steam", "twitter"]
if "twitter" in enabled_platforms and not settings.twitter_enabled:
stats["twitter_skipped"] += 1
raw_items: list[RawItem] = []
if "steam" in enabled_platforms:
steam = SteamClient(settings.app_id)
try:
review_pages = None if full else 2
review_items = steam.fetch_reviews(max_pages=review_pages)
discussion_pages = (
settings.discussion_full_scan_max_pages
if full
else settings.discussion_incremental_max_pages
)
discussion_items = steam.fetch_discussions(
full=full,
max_pages=discussion_pages,
time_limit_seconds=settings.full_scan_time_limit_seconds,
)
steam_items = list(iter_nonempty([*review_items, *discussion_items]))
raw_items.extend(steam_items)
stats["steam_fetched"] = len(steam_items)
finally:
steam.close()
if "twitter" in enabled_platforms and settings.twitter_enabled:
try:
since_ts = None if full else _twitter_high_watermark_ts(conn)
existing_urls = _recent_twitter_post_urls(
conn,
settings.twitter_incremental_reply_parent_limit,
)
twitter = TwitterClient(_twitter_options(settings))
twitter_items = twitter.fetch_items(
full=full,
since_ts=since_ts,
existing_post_urls=existing_urls,
)
raw_items.extend(twitter_items)
stats["twitter_fetched"] = len(twitter_items)
except Exception as exc: # noqa: BLE001 - keep Steam and old Twitter data intact
stats["twitter_errors"] += 1
stats[f"twitter_error:{type(exc).__name__}"] += 1
messages.append(f"twitter: {exc}")
stats["fetched"] = len(raw_items)
analyzer = OpenRouterClient(settings)
try:
for item in raw_items:
raw_item_id, inserted = upsert_raw_item(conn, item)
prefix = item.source.split("_", 1)[0]
stats["inserted" if inserted else "seen"] += 1
stats[f"{prefix}_{'inserted' if inserted else 'seen'}"] += 1
if inserted:
try:
analysis = analyzer.analyze(item)
save_analysis(conn, raw_item_id, settings.openrouter_model, analysis)
stats["analyzed"] += 1
except Exception as exc: # noqa: BLE001 - keep item pending for retry
conn.execute(
"UPDATE raw_items SET analysis_status = 'error' WHERE id = ?",
(raw_item_id,),
)
stats["analysis_errors"] += 1
stats[f"analysis_error:{type(exc).__name__}"] += 1
conn.commit()
finally:
analyzer.close()
finished = _now()
status = "partial" if messages else "success"
conn.execute(
"""
UPDATE sync_runs
SET finished_at = ?, status = ?, message = ?, stats_json = ?
WHERE id = ?
""",
(finished, status, "\n".join(messages), encode_json(dict(stats)), run_id),
)
if status == "success":
conn.execute(
"""
INSERT INTO sync_state (key, value, updated_at)
VALUES ('last_sync_mode', ?, ?)
ON CONFLICT(key) DO UPDATE SET value = excluded.value, updated_at = excluded.updated_at
""",
(mode, finished),
)
return dict(stats)
except Exception as exc:
finished = _now()
conn.execute(
"""
UPDATE sync_runs
SET finished_at = ?, status = 'failed', message = ?, stats_json = ?
WHERE id = ?
""",
(finished, str(exc), encode_json(dict(stats)), run_id),
)
raise
def analyze_pending(
conn: sqlite3.Connection,
settings: Settings,
limit: int = 50,
since_ts: int | None = None,
) -> dict[str, Any]:
init_db(conn)
analyzer = OpenRouterClient(settings)
stats: Counter[str] = Counter()
try:
params: list[Any] = []
since_clause = ""
if since_ts is not None:
since_clause = "AND COALESCE(published_at, collected_at) >= ?"
params.append(since_ts)
params.append(limit)
rows = conn.execute(
f"""
SELECT * FROM raw_items
WHERE analysis_status IN ('pending', 'error')
{since_clause}
ORDER BY COALESCE(published_at, collected_at) DESC, collected_at DESC, id DESC
LIMIT ?
""",
params,
).fetchall()
for row in rows:
item = RawItem(
source=row["source"],
source_item_id=row["source_item_id"],
source_url=row["source_url"],
content_type=row["content_type"],
author_id=row["author_id"],
author_name=row["author_name"],
title=row["title"],
published_at=row["published_at"],
published_at_text=row["published_at_text"],
updated_at_source=row["updated_at_source"],
content=row["content"],
raw=decode_json(row["raw_json"], {}),
)
try:
analysis = analyzer.analyze(item)
save_analysis(conn, int(row["id"]), settings.openrouter_model, analysis)
stats["analyzed"] += 1
conn.commit()
except Exception as exc: # noqa: BLE001
stats["analysis_errors"] += 1
stats[f"analysis_error:{type(exc).__name__}"] += 1
return dict(stats)
finally:
analyzer.close()

246
app/twitter.py Normal file
View File

@ -0,0 +1,246 @@
from __future__ import annotations
from dataclasses import dataclass
import calendar
import json
from pathlib import Path
import re
import subprocess
import sys
import time
from typing import Any, Iterable
from .models import RawItem
TWITTER_EPOCH_FORMAT = "%a %b %d %H:%M:%S +0000 %Y"
NORMALIZED_DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
@dataclass(frozen=True)
class TwitterScrapeOptions:
username: str
scraper_path: Path
output_dir: Path
browser_provider: str
full_max_no_new: int
incremental_max_no_new: int
thread_max_no_new: int
command_timeout_seconds: int
full_reply_post_limit: int
incremental_reply_parent_limit: int
def parse_twitter_time(value: str | None) -> int | None:
if not value:
return None
text = value.strip()
for fmt in (NORMALIZED_DATE_FORMAT, TWITTER_EPOCH_FORMAT):
try:
parsed = time.strptime(text, fmt)
return calendar.timegm(parsed)
except ValueError:
continue
return None
def _author_from_url(url: str | None) -> str | None:
if not url:
return None
match = re.search(r"(?:x\.com|twitter\.com)/([^/?#]+)/status/\d+", url)
if not match:
return None
value = match.group(1)
return value if value and value.lower() != "i" else None
def _tweet_id_from_item(item: dict[str, Any]) -> str | None:
value = item.get("id")
if value:
return str(value)
url = str(item.get("url") or "")
match = re.search(r"/status/(\d+)", url)
return match.group(1) if match else None
def _tweet_url(username: str, tweet_id: str) -> str:
return f"https://x.com/{username}/status/{tweet_id}"
def _is_original_post(item: dict[str, Any]) -> bool:
return not bool(item.get("is_retweet"))
class TwitterClient:
def __init__(self, options: TwitterScrapeOptions) -> None:
self.options = options
def fetch_items(
self,
*,
full: bool,
since_ts: int | None,
existing_post_urls: Iterable[str] = (),
) -> list[RawItem]:
run_dir = self._new_run_dir()
timeline = self._fetch_timeline(run_dir, full=full)
timeline_items = [
self._post_to_item(item)
for item in timeline
if self._include_by_time(item, since_ts)
]
reply_parent_urls = self._reply_parent_urls(
timeline=timeline,
full=full,
existing_post_urls=existing_post_urls,
)
reply_items: list[RawItem] = []
for parent_url in reply_parent_urls:
thread = self._fetch_thread(run_dir, parent_url)
parent_id = str(thread.get("main_tweet", {}).get("id") or self._id_from_url(parent_url) or "")
for reply in thread.get("replies") or []:
if self._include_by_time(reply, since_ts):
reply_items.append(self._reply_to_item(reply, parent_id=parent_id, parent_url=parent_url))
return [item for item in [*timeline_items, *reply_items] if item.content.strip()]
def _new_run_dir(self) -> Path:
path = self.options.output_dir / time.strftime("%Y%m%d_%H%M%S")
path.mkdir(parents=True, exist_ok=True)
return path
def _fetch_timeline(self, run_dir: Path, *, full: bool) -> list[dict[str, Any]]:
max_no_new = self.options.full_max_no_new if full else self.options.incremental_max_no_new
self._run_scraper(self.options.username, run_dir, max_no_new=max_no_new)
path = run_dir / f"{self.options.username}_posts.json"
return self._read_json(path, expected="timeline posts")
def _fetch_thread(self, run_dir: Path, parent_url: str) -> dict[str, Any]:
tweet_id = self._id_from_url(parent_url)
if not tweet_id:
return {"main_tweet": None, "replies": [], "total_replies": 0}
self._run_scraper(parent_url, run_dir, max_no_new=self.options.thread_max_no_new)
path = run_dir / f"thread_{tweet_id}.json"
return self._read_json(path, expected=f"thread {tweet_id}")
def _run_scraper(self, target: str, run_dir: Path, *, max_no_new: int) -> None:
command = [
sys.executable,
str(self.options.scraper_path),
target,
"--max-no-new",
str(max_no_new),
"--output-dir",
str(run_dir),
"--browser-provider",
self.options.browser_provider,
]
result = subprocess.run(
command,
cwd=Path.cwd(),
capture_output=True,
text=True,
encoding="utf-8",
errors="replace",
timeout=self.options.command_timeout_seconds,
)
output = "\n".join(part for part in [result.stdout, result.stderr] if part).strip()
if result.returncode != 0:
raise RuntimeError(f"Twitter scraper failed for {target}: {output[-1200:]}")
if "登录提示" in output or "未登录" in output or "login" in output.lower():
raise RuntimeError(
"Twitter scraper requires an authenticated X.com browser profile. "
"Run the configured social-media-scraper once with --keep-browser-open, "
"log in to X.com, then retry."
)
def _read_json(self, path: Path, *, expected: str) -> Any:
if not path.exists():
raise RuntimeError(f"Twitter scraper did not produce {expected}: {path}")
return json.loads(path.read_text(encoding="utf-8"))
def _reply_parent_urls(
self,
*,
timeline: list[dict[str, Any]],
full: bool,
existing_post_urls: Iterable[str],
) -> list[str]:
urls: list[str] = []
for item in timeline:
tweet_id = _tweet_id_from_item(item)
url = item.get("url") or (_tweet_url(self.options.username, tweet_id) if tweet_id else "")
if url and _is_original_post(item):
urls.append(str(url))
if not full:
urls.extend(str(url) for url in existing_post_urls if url)
seen: set[str] = set()
unique_urls: list[str] = []
for url in urls:
if url not in seen:
seen.add(url)
unique_urls.append(url)
limit = self.options.full_reply_post_limit if full else self.options.incremental_reply_parent_limit
if limit > 0:
return unique_urls[:limit]
return unique_urls
def _post_to_item(self, item: dict[str, Any]) -> RawItem:
tweet_id = _tweet_id_from_item(item) or ""
url = item.get("url") or _tweet_url(self.options.username, tweet_id)
author = _author_from_url(str(url)) or self.options.username
raw = dict(item)
raw["source_url"] = url
return RawItem(
source="twitter_posts",
source_item_id=f"post:{tweet_id}",
source_url=str(url),
content_type="twitter_post",
author_id=author,
author_name=author,
title=None,
published_at=parse_twitter_time(item.get("date")),
published_at_text=item.get("date"),
updated_at_source=None,
content=str(item.get("text") or ""),
raw=raw,
)
def _reply_to_item(self, item: dict[str, Any], *, parent_id: str, parent_url: str) -> RawItem:
tweet_id = _tweet_id_from_item(item) or ""
url = item.get("url") or _tweet_url(_author_from_url(parent_url) or self.options.username, tweet_id)
author = _author_from_url(str(url)) or str(item.get("in_reply_to") or "")
raw = dict(item)
raw["parent_tweet_id"] = parent_id
raw["parent_url"] = parent_url
raw["source_url"] = url
return RawItem(
source="twitter_replies",
source_item_id=f"reply:{tweet_id}",
source_url=str(url),
content_type="twitter_reply",
author_id=author or None,
author_name=author or None,
title=f"Reply to {parent_id}" if parent_id else None,
published_at=parse_twitter_time(item.get("date")),
published_at_text=item.get("date"),
updated_at_source=None,
content=str(item.get("text") or ""),
raw=raw,
)
def _include_by_time(self, item: dict[str, Any], since_ts: int | None) -> bool:
if since_ts is None:
return True
published_at = parse_twitter_time(item.get("date"))
if published_at is None:
return True
return published_at >= since_ts
def _id_from_url(self, url: str) -> str | None:
match = re.search(r"/status/(\d+)", url)
return match.group(1) if match else None

8
requirements.txt Normal file
View File

@ -0,0 +1,8 @@
beautifulsoup4==4.12.3
fastapi==0.115.6
httpx==0.28.1
python-multipart==0.0.20
python-dotenv==1.0.1
playwright==1.56.0
requests==2.31.0
uvicorn==0.34.0

View File

@ -0,0 +1,307 @@
# Steam 社区监控一期计划
## 目标
第一阶段先接入 Steam 两个信息源:
1. Steam 评测信息
2. Steam 讨论社区信息:`https://steamcommunity.com/app/3774440/discussions`
系统每 30 分钟刷新一次。第一轮全量抓取 Steam 评测、讨论区主题和讨论区回复;后续只做增量更新。所有新增内容调用 OpenRouter 的 `deepseek/deepseek-v4-pro` 做分类和回复必要性评估,并在 dashboard 中展示、筛选、高亮和追踪人工处理状态。
## 已确认事实
| 判断 | 类型 | 证据 | 决策影响 |
|---|---|---|---|
| AppID 为 `3774440` 的 Steam 评测 API 当前有数据 | 当前事实 | 本地请求 `https://store.steampowered.com/appreviews/3774440?...` 成功,返回 `total_reviews=130``review_score_desc=Very Positive` | 一期可以直接接入评测 API |
| Steam 讨论区页面当前可访问 | 当前事实 | 本地请求 `https://steamcommunity.com/app/3774440/discussions/` 返回 HTTP 200页面包含 forum/topic 内容 | 一期可以用 HTTP + HTML 解析抓讨论区 |
| `deepseek/deepseek-v4-pro` 当前存在于 OpenRouter 模型列表 | 当前事实 | 本地请求 OpenRouter models API 返回该模型,支持 `response_format``structured_outputs` | 一期可按结构化 JSON 分类设计 |
| Steam 评测数量存在口径差异风险 | 经验事实 | 用户级经验记录Steam `appreviews` 受缓存、语言、购买类型和索引延迟影响 | 统计口径不能只依赖单一请求 |
## 一期范围
### 做
- 每 30 分钟刷新 Steam 评测和 Steam 讨论区。
- 第一轮全量抓取;后续增量抓取新增或更新内容。
- 对 Steam 评测、讨论区主题、讨论区回复分别去重入库。
- 调用 OpenRouter 模型输出结构化分类结果。
- Dashboard 展示评论/帖子/回复列表、分类结果、原始链接、回复建议和人工处理状态。
- 支持本机运行,架构上预留服务器部署。
### 暂不做
- 暂不接入 Steam 以外社区。
- 暂不做复杂账号权限系统;服务器部署前再补认证方案。
- 暂不自动回复玩家,只做信息发现、分类和处理追踪。
- 暂不做语言筛选;所有语言统一进入采集和模型评估。
## 采集流程
### Steam 评测
使用 Steam Store Reviews API
```text
GET https://store.steampowered.com/appreviews/3774440
```
基础参数:
- `json=1`
- `num_per_page=100`
- `language=all`
- `filter=recent`
- `purchase_type=all`
- `cursor=*` 起步,后续使用响应中的 cursor 翻页
评测去重主键:
- `steam_review:{recommendationid}`
评测建议保留字段:
- `recommendationid`
- `voted_up`
- `review`
- `language`
- `timestamp_created`
- `timestamp_updated`
- `author.steamid`
- `author.personaname`
- `author.profile_url`
- `author.playtime_forever`
- `votes_up`
- `comment_count`
- `steam_purchase`
- `received_for_free`
- `source_url`
评测链接可由 `recommendationid` 构造:
```text
https://steamcommunity.com/profiles/{steamid}/recommended/3774440/#developer_response
```
若用户 profile URL 可用,也应保留原始 `profile_url` 作为辅助追溯字段。
### Steam 讨论区
使用 HTTP 请求讨论区列表页:
```text
https://steamcommunity.com/app/3774440/discussions/
```
翻页参数:
```text
?fp=2
?fp=3
```
第一轮抓取所有可访问讨论页和所有可访问回复。后续增量刷新时,从最新列表页开始向后翻页,直到遇到本地已存在且未更新的主题为止;若 Steam 页面无法稳定判断更新时间,则以最近若干页作为增量窗口,并保留手动全量重扫入口。
讨论区去重主键:
- 主题:`steam_discussion_topic:{topic_id}`
- 回复:`steam_discussion_reply:{topic_id}:{reply_id}`,如果页面拿不到稳定 reply id则用 `topic_id + author + timestamp + content_hash`
讨论区建议保留字段:
- `topic_id`
- `topic_url`
- `title`
- `author`
- `published_at_text`
- `content`
- `reply_count`
- `reply_author`
- `reply_time_text`
- `reply_content`
- `reply_url`
- `source_url`
## 数据模型
建议先用 SQLite 跑通本机版本;部署服务器时可迁移 PostgreSQL。
核心表可以先压成三类:
### `raw_items`
保存原始社区内容及来源信息。
关键字段:
- `id`
- `source`
- `source_item_id`
- `source_url`
- `content_type`
- `author_id`
- `author_name`
- `published_at`
- `collected_at`
- `content`
- `raw_json`
- `content_hash`
### `analysis_results`
保存模型分类结果。
关键字段:
- `raw_item_id`
- `model`
- `sentiment`
- `is_positive`
- `is_negative`
- `has_actionable_feedback`
- `feedback_types`
- `reply_recommended`
- `reply_priority`
- `reply_suggestion`
- `summary`
- `priority`
- `confidence`
- `model_json`
- `analyzed_at`
### `work_items`
保存人工处理状态。
关键字段:
- `raw_item_id`
- `status`
- `owner`
- `notes`
- `last_handled_at`
- `created_at`
- `updated_at`
状态枚举建议:
- `new`
- `read`
- `needs_reply`
- `replied`
- `needs_fix`
- `archived`
## OpenRouter 分类方案
模型:
```text
deepseek/deepseek-v4-pro
```
OpenRouter Key
- 本机和服务器都使用 `.env` / 环境变量读取,不在项目文件中明文保存。
- 用户级 `auth.json` 只作为本机开发时迁移 key 的来源,不作为项目运行时依赖。
- 推荐变量名:`OPENROUTER_API_KEY`
目标输出 JSON
```json
{
"sentiment": "positive | negative | mixed | neutral",
"is_positive": true,
"is_negative": false,
"has_actionable_feedback": true,
"feedback_types": ["bug", "suggestion", "balance", "ui", "localization", "performance", "pricing", "content", "question", "other"],
"reply_recommended": true,
"reply_priority": "none | low | medium | high",
"reply_suggestion": "建议运营或开发如何回复;不需要回复时为空字符串",
"summary": "一句话摘要",
"priority": "low | medium | high",
"confidence": 0.0,
"reason": "简短分类依据"
}
```
分类规则:
- `is_positive` / `is_negative` 对应用户要求的好评、差评展示。
- `has_actionable_feedback=true` 表示包含具体建议、问题反馈、bug、平衡性、UI、翻译、本地化、性能、价格、内容量等可处理信息。
- `reply_recommended=true` 表示建议人工回复或处理,高优先级内容需要在 dashboard 高亮。
- 讨论区主题和回复都必须进入模型评估;不能只评估主题原帖。
- Steam 评测本身的 `voted_up` 作为强信号,但不要覆盖文本判断;例如推荐评测里也可能包含具体差评点。
- 每条结果必须保留 `source_url`dashboard 中直接跳转原始评论或讨论帖。
## Dashboard 一期页面
第一版页面不追求复杂,重点是运营处理效率。
建议视图:
- 总览指标:新增数量、未处理数量、差评数量、具体反馈数量、高优先级数量、已分析数量、待补跑数量、最近更新时间。
- 内容列表:来源、内容类型、时间、作者、摘要、情绪、反馈类型、优先级、是否建议回复、处理状态、原始链接。
- 筛选:信息源、内容类型、情绪、是否具体反馈、是否建议回复、反馈类型、处理状态、时间范围。
- 高亮:`reply_recommended=true``priority=high` 的帖子/回复。
- 详情:原文、模型分类、回复建议、原始链接、备注、负责人、状态变更。
- 排序:建议回复优先;同组内按发布时间新到旧。
## 定时与失败处理
定时:
- 默认每 30 分钟执行一次采集任务。
- 第一轮执行全量抓取;全量完成后记录同步游标、已见主题、已见回复和评测 cursor/时间水位。
- 首轮全量建议支持断点续跑:每完成一页讨论列表、一个主题详情、一个评测分页后写入进度,失败后从最近进度恢复。
- 首轮全量不建议设置过小页数上限,否则会破坏“全抓”目标;建议设置安全保护,例如单次最多连续运行 2 小时或最多抓取 500 页,并允许下次继续。
- 本机先用应用内 scheduler 或命令行手动触发验证;服务器部署时再选 systemd timer、cron 或队列 worker。
失败处理:
- Steam 请求失败:记录错误,下一轮重试,不删除旧数据。
- OpenRouter 请求失败:保留 raw item标记 `analysis_pending`,下一轮或手动补跑。
- JSON 解析失败:保存模型原始输出,进入待复核状态。
- 重复采集:通过 source item id 和 content hash 去重。
## 部署前提
本机 MVP
- 本地数据库
- 本地 dashboard
- 从 `.env` 读取 OpenRouter API Key
- 手动或定时刷新
服务器部署前需要补充:
- 访问认证
- 持久化数据库位置和备份策略
- 后台任务运行方式
- 日志与错误告警
- OpenRouter 调用预算和速率控制
- Steam 抓取频率和 User-Agent 策略
## 已定实现决策
- 密钥配置:使用 `.env` / 环境变量,变量名 `OPENROUTER_API_KEY`
- 首轮抓取:全量抓取,支持断点续跑;用运行时间或高页数阈值做安全保护,不用小页数上限替代全量目标。
- 负责人字段:按小团队制作人/处理人文本字段设计,暂不接用户账号系统。
## 当前实现状态
- 已实现 Python/FastAPI + SQLite MVP。
- 已实现 Steam 评测 API 抓取。
- 已实现 Steam 讨论区主题与回复抓取。
- 已实现 OpenRouter `deepseek/deepseek-v4-pro` 结构化分类。
- 已实现 dashboard、手动同步、后台 30 分钟增量同步、处理状态更新。
- 已实现局域网服务监听 `0.0.0.0:8000`
- 已实现 Steam 讨论区中文时间解析,支持 `x 小时以前``3 月 7 日 下午 4:52``2025 年 8 月 9 日 下午 3:29`
- 已补跑完成 2026-05-01 之后 209 条内容的 AI 分析。
## 后续平台接入约束
- 新平台不要复制 Steam 私有逻辑;应新增平台采集器,输出统一 `RawItem`
- 新平台继续复用 `raw_items``analysis_results``work_items`
- 每个平台必须明确稳定去重主键、原始链接、发布时间解析、首轮全量和后续增量策略。
- 需要登录态或浏览器自动化的平台,先单独做方案和当前事实验证,再接入同步链路。

View File

@ -0,0 +1,67 @@
# 后续社区平台接入指南
## 当前架构
当前 MVP 是 Python/FastAPI + SQLite
- `app/main.py`dashboard、手动同步、补跑分析、处理状态更新、后台 30 分钟增量同步。
- `app/steam.py`Steam 评测、讨论区主题和回复采集器。
- `app/sync.py`:统一同步流程、入库去重、调用模型分析、补跑分析。
- `app/openrouter.py`OpenRouter `deepseek/deepseek-v4-pro` 结构化分类。
- `app/db.py`SQLite schema。
- `app/models.py`:统一原始内容对象 `RawItem`
- `app/cli.py`:命令行入口。
## 统一数据流
```text
平台采集器 -> RawItem -> raw_items -> OpenRouter -> analysis_results -> work_items -> dashboard
```
新平台不要直接改 dashboard 数据结构。优先让平台采集器输出 `RawItem`,复用现有同步和分析流程。
## RawItem 字段约定
新增平台采集器至少要提供:
- `source`:平台标识,例如 `steam_reviews``steam_discussions`
- `source_item_id`:稳定去重主键,必须包含平台和内容 ID。
- `source_url`:能跳回原始内容的链接。
- `content_type`:内容类型,例如 `review``discussion_topic``discussion_reply`
- `author_id` / `author_name`:能取到多少填多少。
- `title`:帖子标题,没有则为空。
- `published_at`Unix 时间戳,优先提供。
- `published_at_text`:平台原始时间文本。
- `updated_at_source`:平台原始更新时间,没有则为空。
- `content`:送入模型分析的正文。
- `raw`:平台原始字段 JSON。
## 新平台接入步骤
1. 验证当前事实:页面/API 是否可访问、是否需要登录态、是否有频率限制。
2. 定义内容类型和去重主键。
3. 实现平台采集器,输出 `list[RawItem]`
4. 在 `app/sync.py` 中接入采集器,保持失败不删除旧数据。
5. 跑小样本 smoke test抓取、去重、AI 分析、dashboard 展示。
6. 再做首轮全量策略和后续增量策略。
## 已知实现决策
- AI 模型OpenRouter `deepseek/deepseek-v4-pro`
- Key`.env` / 环境变量 `OPENROUTER_API_KEY`
- Dashboard 排序:建议回复优先,同组内按发布时间新到旧。
- 补跑分析:每批最多 20 条,按 `published_at/collected_at` 新到旧。
- 局域网服务:`python -m uvicorn app.main:app --host 0.0.0.0 --port 8000`
- 当前无登录认证,开放到局域网有修改处理状态风险。
## 新平台方案必须回答
- 这个平台监控的运营目的是什么?
- 抓哪些内容类型?
- 首轮是否全量?全量边界是什么?
- 后续增量根据什么停止?
- 原始链接如何生成?
- 发布时间是否可解析?相对时间如何处理?
- 是否要抓回复/评论楼中楼?
- 是否需要登录态、cookie、API key 或浏览器自动化?
- 失败、限流和重复采集如何处理?