✨ YouTube Transcript API ✨

これはPython APIで、指定したYouTube動画の字幕/トランスクリプトを取得できます。自動生成された字幕にも対応し、字幕の翻訳もサポートしています。他のSeleniumベースのソリューションとは異なり、ヘッドレスブラウザを必要としません！

このプロジェクトのメンテナンスは、すべてのコントリビューターとスポンサーによって可能になっています。このプロジェクトをスポンサーし、あなたのアバターや会社のロゴを以下に表示したい場合はこちらをクリックしてください。💖

インストール

このモジュールはpipを使用してインストールすることを推奨します：

pip install youtube-transcript-api

このモジュールを既存のアプリケーションに統合することも、CLI経由で使用することもできます。

API

指定した動画のトランスクリプトを取得する最も簡単な方法は、以下を実行することです：

from youtube_transcript_api import YouTubeTranscriptApi

ytt_api = YouTubeTranscriptApi()
ytt_api.fetch(video_id)

注意: デフォルトでは、動画の英語字幕にアクセスしようとします。動画の言語が異なる場合、または別の言語で字幕を取得したい場合は、以下のセクションをお読みください。

注意: 動画URLではなく、動画IDを渡してください。URLがhttps://www.youtube.com/watch?v=12345の動画の場合、IDは12345です。

これはFetchedTranscriptオブジェクトを返します。このオブジェクトは以下のような見た目です：

FetchedTranscript(
    snippets=[
        FetchedTranscriptSnippet(
            text="Hey there",
            start=0.0,
            duration=1.54,
        ),
        FetchedTranscriptSnippet(
            text="how are you",
            start=1.54,
            duration=4.16,
        ),
        # ...
    ],
    video_id="12345",
    language="English",
    language_code="en",
    is_generated=False,
)

このオブジェクトはListのほとんどのインターフェースを実装しています：

ytt_api = YouTubeTranscriptApi()
fetched_transcript = ytt_api.fetch(video_id)

# is iterable
for snippet in fetched_transcript:
    print(snippet.text)

# indexable
last_snippet = fetched_transcript[-1]

# provides a length
snippet_count = len(fetched_transcript)

生のトランスクリプトデータを処理したい場合は、fetched_transcript.to_raw_data()を呼び出すことができます。これは辞書のリストを返します：

[
    {
        'text': 'Hey there',
        'start': 0.0,
        'duration': 1.54
    },
    {
        'text': 'how are you',
        'start': 1.54
        'duration': 4.16
    },
    # ...
]

異なる言語の取得

字幕が希望の言語で取得されるようにしたい場合は、languagesパラメータを追加できます（デフォルトは英語です）。

YouTubeTranscriptApi().fetch(video_id, languages=['de', 'en'])

これは降順の優先順位を持つ言語コードのリストです。この例では、最初にドイツ語字幕（'de'）を取得しようとし、失敗した場合は英語字幕（'en'）を取得します。どの言語が利用可能かを確認したい場合は、list()を参照してください。

1つの言語のみを希望する場合でも、languages引数をリストとしてフォーマットする必要があります

YouTubeTranscriptApi().fetch(video_id, languages=['de'])

フォーマットの保持

HTMLのフォーマット要素（<i>（イタリック）や<b>（太字）など）を保持したい場合は、preserve_formatting=Trueを追加できます。

YouTubeTranscriptApi().fetch(video_ids, languages=['de', 'en'], preserve_formatting=True)

利用可能な字幕のリスト

指定した動画で利用可能なすべての字幕をリストしたい場合は、以下を呼び出します：

ytt_api = YouTubeTranscriptApi()
transcript_list = ytt_api.list(video_id)

これはTranscriptListオブジェクトを返します。このオブジェクトはイテラブルで、特定の言語やタイプの字幕リストをフィルタリングするメソッドを提供します：

transcript = transcript_list.find_transcript(['de', 'en'])

デフォルトでは、このモジュールは要求された言語の字幕が手動作成と自動生成の両方で利用可能な場合、常に手動作成の字幕を選択します。TranscriptListを使用すると、特定の字幕タイプを検索することでこのデフォルトの動作をバイパスできます：

# filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])

# or automatically generated ones
transcript = transcript_list.find_generated_transcript(['de', 'en'])

メソッドfind_generated_transcript、find_manually_created_transcript、find_transcriptはTranscriptオブジェクトを返します。これらには字幕に関するメタデータが含まれています：

print(
    transcript.video_id,
    transcript.language,
    transcript.language_code,
    # whether it has been manually created or generated by YouTube
    transcript.is_generated,
    # whether this transcript can be translated or not
    transcript.is_translatable,
    # a list of languages the transcript can be translated to
    transcript.translation_languages,
)

また、実際の字幕データを取得できるメソッドを提供します：

transcript.fetch()

これはYouTubeTranscriptApi().fetch()と同様に、FetchedTranscriptオブジェクトを返します。

字幕の翻訳

YouTubeには字幕を自動翻訳する機能があります。このモジュールでもこの機能にアクセスできます。そのためには、Transcriptオブジェクトはtranslate()メソッドを提供し、新しい翻訳されたTranscriptオブジェクトを返します：

transcript = transcript_list.find_transcript(['en'])
translated_transcript = transcript.translate('de')
print(translated_transcript.fetch())

例

from youtube_transcript_api import YouTubeTranscriptApi

ytt_api = YouTubeTranscriptApi()

# retrieve the available transcripts
transcript_list = ytt_api.list('video_id')

# iterate over all available transcripts
for transcript in transcript_list:

    # the Transcript object provides metadata properties
    print(
        transcript.video_id,
        transcript.language,
        transcript.language_code,
        # whether it has been manually created or generated by YouTube
        transcript.is_generated,
        # whether this transcript can be translated or not
        transcript.is_translatable,
        # a list of languages the transcript can be translated to
        transcript.translation_languages,
    )

    # fetch the actual transcript data
    print(transcript.fetch())

    # translating the transcript will return another transcript object
    print(transcript.translate('en').fetch())

# you can also directly filter for the language you are looking for, using the transcript list
transcript = transcript_list.find_transcript(['de', 'en'])  

# or just filter for manually created transcripts  
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])  

# or automatically generated ones  
transcript = transcript_list.find_generated_transcript(['de', 'en'])

IP禁止の回避（`RequestBlocked`または`IpBlocked`例外）

残念ながら、YouTubeはクラウドプロバイダー（AWS、Google Cloud Platform、Azureなど）に属することが知られているほとんどのIPをブロックし始めました。つまり、コードを任意のクラウドソリューションにデプロイすると、ReuquestBlockedまたはIpBlocked例外に遭遇する可能性が高いです。セルフホストソリューションのIPでも、リクエストが多すぎると同じことが起こる可能性があります。プロキシを使用してこれらのIP禁止を回避できます。ただし、YouTubeは静的プロキシを長時間使用すると禁止するため、回線住宅プロキシを選択することが最も信頼性の高いオプションです。

回線住宅プロキシを提供するさまざまなプロバイダーがありますが、さまざまなオファリングをテストした後、Webshareが最も信頼性が高いことがわかりました。そのため、このモジュールに統合し、できるだけ簡単に設定できるようにしました。

Webshareの使用

Webshareアカウントを作成し、ワークロードに適した「Residential」プロキシパッケージを購入したら（「Proxy Server」または「Static Residential」を購入しないでください！）、Webshare Proxy Settingsを開いて「Proxy Username」と「Proxy Password」を取得します。この情報を使用して、YouTubeTranscriptApiを次のように初期化できます：

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.proxies import WebshareProxyConfig

ytt_api = YouTubeTranscriptApi(
    proxy_config=WebshareProxyConfig(
        proxy_username="<proxy-username>",
        proxy_password="<proxy-password>",
    )
)

# all requests done by ytt_api will now be proxied through Webshare
ytt_api.fetch(video_id)

WebshareProxyConfigを使用すると、回線住宅プロキシを使用するようになり、追加の設定は必要ありません。

ここではリファラルリンクが使用されています。これらのリンクを通じて行われた購入は、このオープンソースプロジェクトをサポートします。非常に感謝しています！💖😊🙏💖

ただし、もちろん、他のプロバイダーを使用したい場合や独自のソリューションを実装したい場合は、以下のセクションで説明するGenericProxyConfigクラスを使用して独自のプロキシソリューションを統合することも自由です。

他のプロキシソリューションの使用

Webshareの代わりに、GenericProxyConfigクラスを使用して任意のHTTP/HTTPS/SOCKSプロキシを設定できます：

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.proxies import GenericProxyConfig

ytt_api = YouTubeTranscriptApi(
    proxy_config=GenericProxyConfig(
        http_url="http://user:[email protected]:port",
        https_url="https://user:[email protected]:port",
    )
)

# all requests done by ytt_api will now be proxied using the defined proxy URLs
ytt_api.fetch(video_id)

プロキシを使用しても、YouTubeがプロキシのIPをブロックする可能性があるため、禁止されない保証はありません。したがって、信頼性を最大化したい場合は、プロキシアドレスのプールをローテーションするソリューションを選択する必要があります。

リクエストのデフォルトの上書き

YouTubeTranscriptApiオブジェクトを初期化すると、すべてのHTTP(S)リクエストに使用されるrequests.Sessionが作成されます。これにより、複数のリクエストを取得する際にクッキーをキャッシュできます。ただし、YouTubeTranscriptApiの異なるインスタンス間でクッキーを手動で共有したり、デフォルトを上書きしたり、カスタムヘッダーを指定したり、SSL証明書を指定したりする場合は、requests.Sessionオブジェクトをコンストラクタに渡すことができます。

from requests import Session

http_client = Session()

# set custom header
http_client.headers.update({"Accept-Encoding": "gzip, deflate"})

# set path to CA_BUNDLE file
http_client.verify = "/path/to/certfile"

ytt_api = YouTubeTranscriptApi(http_client=http_client)
ytt_api.fetch(video_id)

# share same Session between two instances of YouTubeTranscriptApi
ytt_api_2 = YouTubeTranscriptApi(http_client=http_client)
# now shares cookies with ytt_api
ytt_api_2.fetch(video_id)

クッキー認証

一部の動画は年齢制限されているため、何らかの認証がないとこのモジュールはそれらの動画にアクセスできません。残念ながら、YouTube APIの最近の変更により、クッキーベースの認証の現在の実装が壊れてしまったため、この機能は現在利用できません。

フォーマッタの使用

フォーマッタは、渡したトランスクリプトに追加の処理層を提供することを目的としています。目標は、FetchedTranscriptオブジェクトを特定の「フォーマット」の一貫した文字列に変換することです。基本的なテキスト（.txt）や、JSON（.json）、WebVTT（.vtt）、SRT（.srt）、コンマ区切り形式（.csv）など、定義された仕様を持つフォーマットなどです。

formattersサブモジュールには、いくつかの基本的なフォーマッタが用意されており、そのまま使用することも、必要に応じて拡張することもできます：

JSONFormatter
PrettyPrintFormatter
TextFormatter
WebVTTFormatter
SRTFormatter

formattersモジュールからインポートする方法は以下の通りです。

# the base class to inherit from when creating your own formatter.
from youtube_transcript_api.formatters import Formatter

# some provided subclasses, each outputs a different string format.
from youtube_transcript_api.formatters import JSONFormatter
from youtube_transcript_api.formatters import TextFormatter
from youtube_transcript_api.formatters import WebVTTFormatter
from youtube_transcript_api.formatters import SRTFormatter

フォーマッタの例

トランスクリプトを取得してJSONファイルに保存したい場合、以下のようになります：

# your_custom_script.py

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import JSONFormatter

ytt_api = YouTubeTranscriptApi()
transcript = ytt_api.fetch(video_id)

formatter = JSONFormatter()

# .format_transcript(transcript) turns the transcript into a JSON string.
json_formatted = formatter.format_transcript(transcript)

# Now we can write it out to a file.
with open('your_filename.json', 'w', encoding='utf-8') as json_file:
    json_file.write(json_formatted)

# Now should have a new JSON file that you can easily read back into Python.

追加のキーワード引数の渡し方

JSONFormatterはjson.dumps()を利用しているため、.format_transcript(transcript)にキーワード引数を転送することもできます。例えば、indent=2キーワード引数を転送することで、ファイル出力をより見やすくすることができます。

json_formatted = JSONFormatter().format_transcript(transcript, indent=2)

カスタムフォーマッタの例

独自のフォーマッタクラスを実装できます。Formatter基底クラスを継承し、format_transcript(self, transcript: FetchedTranscript, **kwargs) -> strとformat_transcripts(self, transcripts: List[FetchedTranscript], **kwargs) -> strメソッドを実装するだけです。これらのメソッドは、フォーマッタインスタンスで呼び出されたときに最終的に文字列を返す必要があります。

class MyCustomFormatter(Formatter):
    def format_transcript(self, transcript: FetchedTranscript, **kwargs) -> str:
        # Do your custom work in here, but return a string.
        return 'your processed output data as a string.'

    def format_transcripts(self, transcripts: List[FetchedTranscript], **kwargs) -> str:
        # Do your custom work in here to format a list of transcripts, but return a string.
        return 'your processed output data as a string.'

CLI

動画IDをパラメータとしてCLIスクリプトを実行すると、結果がコマンドラインに出力されます：

youtube_transcript_api <first_video_id> <second_video_id> ...

CLIでは、優先言語のリストを提供するオプションもあります：

youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en

自動生成または手動作成の字幕を除外するかどうかも指定できます：

youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --exclude-generated
youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --exclude-manually-created

ファイルに書き込んだり、別のアプリケーションにパイプしたりしたい場合は、以下の行を使用して結果をJSONとして出力することもできます：

youtube_transcript_api <first_video_id> <second_video_id> ... --languages de en --format json > transcripts.json

CLIを使用して字幕を翻訳することも可能です：

youtube_transcript_api <first_video_id> <second_video_id> ... --languages en --translate de

指定した動画でどの言語が利用可能かわからない場合は、すべての利用可能な字幕をリストするために以下を呼び出せます：

youtube_transcript_api --list-transcripts <first_video_id>

動画のIDがハイフンで始まる場合、CLIがそれを引数名と間違えないように、\を使用してハイフンをマスクする必要があります。例えば、IDが-abc123の動画のトランスクリプトを取得するには、以下を実行します：

youtube_transcript_api "\-abc123"

CLIを使用したIP禁止の回避

YouTubeがIPをブロックしているためにReqestBlockedまたはIpBlockedエラーが発生している場合は、IP禁止の回避で説明したように、回線住宅プロキシを使用してこれを回避できます。Webshare「Residential」プロキシをCLI経由で使用するには、Webshareアカウントを作成し、ワークロードに適した「Residential」プロキシパッケージを購入する必要があります（「Proxy Server」または「Static Residential」を購入しないでください！）。その後、Webshare Proxy Settingsで見つかる「Proxy Username」と「Proxy Password」を使用して、以下のコマンドを実行します：

youtube_transcript_api <first_video_id> <second_video_id> --webshare-proxy-username "username" --webshare-proxy-password "password"

他のプロキシソリューションを使用したい場合は、以下のコマンドを使用して汎用HTTP/HTTPSプロキシを設定できます：

youtube_transcript_api <first_video_id> <second_video_id> --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port

CLIを使用したクッキー認証

クッキー認証で説明したように、CLI経由でクッキーを使用して認証するには、以下を実行します：

youtube_transcript_api <first_video_id> <second_video_id> --cookies /path/to/your/cookies.txt

警告

このコードは、YouTubeのウェブクライアントによって呼び出される、YouTube APIの非公開部分を使用しています。そのため、彼らが動作方法を変更した場合、明日には動作しなくなる可能性があります。ただし、そのようなことが起こった場合でも、できるだけ早く動作するように最善を尽くします。動作しなくなった場合は、お知らせください！

貢献

プロジェクトをローカルでセットアップするには、以下を実行します（poetryのインストールが必要です）：

poetry install --with test,dev

テスト、カバレッジ、リンター、フォーマッタを実行するpoeタスクがあります（ビルドを成功させるには、これらすべてをパスする必要があります）：

poe test
poe coverage
poe format
poe lint

コードが必要なすべてのチェックをパスしてグリーンビルドを取得することを確認したいだけの場合は、以下を実行するだけです：

poe precommit

寄付

このプロジェクトが開発時間を短縮してあなたを幸せにするなら、コーヒーをおごることで私を幸せにすることができます。または、このプロジェクトのスポンサーになることもできます :)