✨ YouTube 字幕提取 API ✨

这是一个 Python API，可用于获取指定 YouTube 视频的字幕/转录文本。支持自动生成的字幕，提供翻译功能，且无需像基于 Selenium 的解决方案那样使用无头浏览器！

本项目的维护得益于所有贡献者和赞助者。如果您想赞助本项目并让您的头像或公司标志出现在下方，请点击此处。💖

安装

pip install youtube-transcript-api

您可以将此模块集成到现有应用中，或直接通过命令行界面使用。

API

获取视频字幕的最简单方式是执行：

from youtube_transcript_api import YouTubeTranscriptApi

ytt_api = YouTubeTranscriptApi()
ytt_api.fetch(video_id)

注意： 默认情况下，此操作会尝试获取视频的英文字幕。如果视频使用其他语言，或您需要获取其他语言的字幕，请阅读以下部分。

注意： 传入的是视频 ID，而非视频 URL。例如，URL 为 https://www.youtube.com/watch?v=12345 的视频，其 ID 是 12345。

这将返回一个 FetchedTranscript 对象，其结构大致如下：

FetchedTranscript(
    snippets=[
        FetchedTranscriptSnippet(
            text="Hey there",
            start=0.0,
            duration=1.54,
        ),
        FetchedTranscriptSnippet(
            text="how are you",
            start=1.54,
            duration=4.16,
        ),
        # ...
    ],
    video_id="12345",
    language="English",
    language_code="en",
    is_generated=False,
)

此对象实现了 List 的大部分接口：

ytt_api = YouTubeTranscriptApi()
fetched_transcript = ytt_api.fetch(video_id)

# is iterable
for snippet in fetched_transcript:
    print(snippet.text)

# indexable
last_snippet = fetched_transcript[-1]

# provides a length
snippet_count = len(fetched_transcript)

如果您希望处理原始字幕数据，可以调用 fetched_transcript.to_raw_data()，它将返回一个字典列表：

[
    {
        'text': 'Hey there',
        'start': 0.0,
        'duration': 1.54
    },
    {
        'text': 'how are you',
        'start': 1.54
        'duration': 4.16
    },
    # ...
]

获取不同语言的字幕

如需确保获取特定语言的字幕（默认为英语），可添加 languages 参数：

YouTubeTranscriptApi().fetch(video_id, languages=['de', 'en'])

这是一个按优先级降序排列的语言代码列表。此例中，系统会先尝试获取德语字幕（'de'），若失败则获取英语字幕（'en'）。如需查看可用语言，请参考 list() 方法。

即使只需一种语言，仍需将 languages 参数格式化为列表：

YouTubeTranscriptApi().fetch(video_id, languages=['de'])

保留格式

如需保留 HTML 格式元素（如斜体 <i> 和粗体 <b>），可添加 preserve_formatting=True：

YouTubeTranscriptApi().fetch(video_ids, languages=['de', 'en'], preserve_formatting=True)

列出可用字幕

如需列出视频的所有可用字幕，可调用：

ytt_api = YouTubeTranscriptApi()
transcript_list = ytt_api.list(video_id)

这将返回一个可迭代的 TranscriptList 对象，提供按语言和类型筛选字幕的方法，例如：

transcript = transcript_list.find_transcript(['de', 'en'])

默认情况下，若请求的语言同时存在人工创建和自动生成的字幕，本模块优先选择人工创建的字幕。通过 TranscriptList 可绕过此默认行为，搜索特定类型的字幕：

# filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])

# or automatically generated ones
transcript = transcript_list.find_generated_transcript(['de', 'en'])

方法 find_generated_transcript、find_manually_created_transcript 和 find_transcript 返回 Transcript 对象，包含字幕的元数据：

print(
    transcript.video_id,
    transcript.language,
    transcript.language_code,
    # whether it has been manually created or generated by YouTube
    transcript.is_generated,
    # whether this transcript can be translated or not
    transcript.is_translatable,
    # a list of languages the transcript can be translated to
    transcript.translation_languages,
)

并提供了获取实际字幕数据的方法：

transcript.fetch()

此方法返回一个 FetchedTranscript 对象，与 YouTubeTranscriptApi().fetch() 相同。

翻译字幕

YouTube 提供自动翻译字幕的功能。本模块也支持此功能。Transcript 对象提供 translate() 方法，返回一个新的翻译后的 Transcript 对象：

transcript = transcript_list.find_transcript(['en'])
translated_transcript = transcript.translate('de')
print(translated_transcript.fetch())

示例

from youtube_transcript_api import YouTubeTranscriptApi

ytt_api = YouTubeTranscriptApi()

# retrieve the available transcripts
transcript_list = ytt_api.list('video_id')

# iterate over all available transcripts
for transcript in transcript_list:

    # the Transcript object provides metadata properties
    print(
        transcript.video_id,
        transcript.language,
        transcript.language_code,
        # whether it has been manually created or generated by YouTube
        transcript.is_generated,
        # whether this transcript can be translated or not
        transcript.is_translatable,
        # a list of languages the transcript can be translated to
        transcript.translation_languages,
    )

    # fetch the actual transcript data
    print(transcript.fetch())

    # translating the transcript will return another transcript object
    print(transcript.translate('en').fetch())

# you can also directly filter for the language you are looking for, using the transcript list
transcript = transcript_list.find_transcript(['de', 'en'])  

# or just filter for manually created transcripts  
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])  

# or automatically generated ones  
transcript = transcript_list.find_generated_transcript(['de', 'en'])

绕过 IP 封禁（`RequestBlocked` 或 `IpBlocked` 异常）

YouTube 已开始封禁大多数已知属于云服务提供商（如 AWS、Google Cloud Platform、Azure 等）的 IP，因此将代码部署到任何云解决方案时，很可能会遇到 RequestBlocked 或 IpBlocked 异常。自托管解决方案的 IP 若请求过多，也可能被封禁。可通过代理绕过 IP 封禁，但 YouTube 会封禁长期使用的静态代理，因此选择轮换住宅代理是最可靠的方案。

多家供应商提供轮换住宅代理服务，经测试后，我发现 Webshare 最为可靠，因此将其集成到本模块中，以便轻松设置。

使用 Webshare

创建 Webshare 账户并购买适合您工作负载的“住宅”代理套餐（确保不要购买“代理服务器”或“静态住宅”代理！）后，打开 Webshare 代理设置获取“代理用户名”和“代理密码”。使用这些信息初始化 YouTubeTranscriptApi：

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.proxies import WebshareProxyConfig

ytt_api = YouTubeTranscriptApi(
    proxy_config=WebshareProxyConfig(
        proxy_username="<proxy-username>",
        proxy_password="<proxy-password>",
    )
)

# all requests done by ytt_api will now be proxied through Webshare
ytt_api.fetch(video_id)

使用 WebshareProxyConfig 默认会启用轮换住宅代理，无需额外配置。

注意，此处使用了推荐链接，通过这些链接购买将支持此开源项目，非常感谢！💖😊🙏💖

当然，您也可以使用 GenericProxyConfig 类集成自己的代理解决方案，如下一节所述。

使用其他代理解决方案

除了 Webshare，您还可以通过 GenericProxyConfig 类设置任何通用的 HTTP/HTTPS/SOCKS 代理：

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.proxies import GenericProxyConfig

ytt_api = YouTubeTranscriptApi(
    proxy_config=GenericProxyConfig(
        http_url="http://user:[email protected]:port",
        https_url="https://user:[email protected]:port",
    )
)

# all requests done by ytt_api will now be proxied using the defined proxy URLs
ytt_api.fetch(video_id)

请注意，使用代理不保证不会被封禁，因为 YouTube 可能封禁代理的 IP！因此，若需最大化可靠性，应选择轮换代理池的解决方案。

覆盖请求默认设置

初始化 YouTubeTranscriptApi 对象时，会创建一个 requests.Session，用于所有 HTTP(S) 请求。这允许在多次请求间缓存 cookies。但您也可以手动传递 requests.Session 对象到构造函数中，以便在不同实例间共享 cookies、覆盖默认设置、自定义请求头、指定 SSL 证书等。

from requests import Session

http_client = Session()

# set custom header
http_client.headers.update({"Accept-Encoding": "gzip, deflate"})

# set path to CA_BUNDLE file
http_client.verify = "/path/to/certfile"

ytt_api = YouTubeTranscriptApi(http_client=http_client)
ytt_api.fetch(video_id)

# share same Session between two instances of YouTubeTranscriptApi
ytt_api_2 = YouTubeTranscriptApi(http_client=http_client)
# now shares cookies with ytt_api
ytt_api_2.fetch(video_id)

Cookie 认证

部分视频有年龄限制，因此若无认证，本模块无法访问这些视频。遗憾的是，YouTube API 的近期更改破坏了当前基于 cookie 的认证实现，因此此功能目前不可用。

使用格式化器

格式化器用于对字幕进行额外处理，将 FetchedTranscript 对象转换为指定“格式”的字符串。例如纯文本（.txt）或具有明确定义的格式，如 JSON（.json）、WebVTT（.vtt）、SRT（.srt）、逗号分隔格式（.csv）等。

formatters 子模块提供了一些基本格式化器，可直接使用或按需扩展：

JSONFormatter
PrettyPrintFormatter
TextFormatter
WebVTTFormatter
SRTFormatter

从 formatters 模块导入的方式如下：

# the base class to inherit from when creating your own formatter.
from youtube_transcript_api.formatters import Formatter

# some provided subclasses, each outputs a different string format.
from youtube_transcript_api.formatters import JSONFormatter
from youtube_transcript_api.formatters import TextFormatter
from youtube_transcript_api.formatters import WebVTTFormatter
from youtube_transcript_api.formatters import SRTFormatter

格式化器示例

假设我们需要获取字幕并保存为 JSON 文件，代码如下：

# your_custom_script.py

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import JSONFormatter

ytt_api = YouTubeTranscriptApi()
transcript = ytt_api.fetch(video_id)

formatter = JSONFormatter()

# .format_transcript(transcript) turns the transcript into a JSON string.
json_formatted = formatter.format_transcript(transcript)

# Now we can write it out to a file.
with open('your_filename.json', 'w', encoding='utf-8') as json_file:
    json_file.write(json_formatted)

# Now should have a new JSON file that you can easily read back into Python.

传递额外关键字参数

由于 JSONFormatter 基于 json.dumps()，您可以将关键字参数传递给 .format_transcript(transcript)，例如通过传递 indent=2 使输出更美观：

json_formatted = JSONFormatter().format_transcript(transcript, indent=2)

自定义格式化器示例

您可以实现自己的格式化器类。只需继承 Formatter 基类，并确保实现 format_transcript(self, transcript: FetchedTranscript, **kwargs) -> str 和 format_transcripts(self, transcripts: List[FetchedTranscript], **kwargs) -> str 方法，最终返回字符串。

class MyCustomFormatter(Formatter):
    def format_transcript(self, transcript: FetchedTranscript, **kwargs) -> str:
        # Do your custom work in here, but return a string.
        return 'your processed output data as a string.'

    def format_transcripts(self, transcripts: List[FetchedTranscript], **kwargs) -> str:
        # Do your custom work in here to format a list of transcripts, but return a string.
        return 'your processed output data as a string.'

命令行界面

使用视频 ID 作为参数执行 CLI 脚本，结果将输出到命令行：

youtube_transcript_api <first_video_id> <second_video_id> ...

CLI 还允许指定首选语言列表：

youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en

您还可以指定是否排除自动生成或人工创建的字幕：

youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --exclude-generated
youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --exclude-manually-created

若需将结果写入文件或通过管道传递给其他应用，可以 JSON 格式输出：

youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --format json > transcripts.json

通过 CLI 翻译字幕也是可行的：

youtube_transcript_api <first_video_id> <second_video_id> ... --languages en --translate de

若不确定视频的可用语言，可调用以下命令列出所有可用字幕：

youtube_transcript_api --list-transcripts <first_video_id>

若视频 ID 以连字符开头，需使用 \ 转义，以免 CLI 将其误认为参数名。例如，获取 ID 为 -abc123 的视频的字幕：

youtube_transcript_api "\-abc123"

通过 CLI 绕过 IP 封禁

若因 YouTube 封禁 IP 而遇到 RequestBlocked 或 IpBlocked 错误，可按绕过 IP 封禁所述使用住宅代理。要通过 CLI 使用 Webshare "住宅"代理，需创建 Webshare 账户并购买适合工作负载的“住宅”代理套餐（确保不要购买“代理服务器”或“静态住宅”代理！）。然后，在 Webshare 代理设置中找到“代理用户名”和“代理密码”，运行以下命令：

youtube_transcript_api <first_video_id> <second_video_id> --webshare-proxy-username "username" --webshare-proxy-password "password"

若偏好其他代理解决方案，可通过以下命令设置通用 HTTP/HTTPS 代理：

youtube_transcript_api <first_video_id> <second_video_id> --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port

通过 CLI 使用 Cookie 认证

如 Cookie 认证所述，通过 CLI 认证：

youtube_transcript_api <first_video_id> <second_video_id> --cookies /path/to/your/cookies.txt

警告

此代码使用了 YouTube API 的未公开部分，由 YouTube 网页客户端调用。因此，若其工作方式变更，无法保证未来仍能正常使用。但若发生此类情况，我将尽力尽快修复。若功能失效，请告知我！

贡献

本地设置项目需运行以下命令（需安装 poetry）：

poetry install --with test,dev

使用 poe 任务运行测试、覆盖率检查、代码检查和格式化（构建通过需满足所有条件）：

poe test
poe coverage
poe format
poe lint

若只需确保代码通过必要检查以获得绿色构建，可运行：

poe precommit

捐赠

若本项目通过减少开发时间让您感到愉快，您可以通过请我喝杯咖啡或成为项目赞助者让我开心 :)