Vertex AI의 최신 멀티모달 모델인 Gemini 1.5 모델을 사용해 보고 100만 개의 토큰 컨텍스트 윈도우로 빌드할 수 있는 항목을 확인해 보세요. Vertex AI의 최신 멀티모달 모델인 Gemini 1.5 모델을 사용해 보고 100만 개의 토큰 컨텍스트 윈도우로 빌드할 수 있는 항목을 확인해 보세요.

긴 오디오 파일을 텍스트로 변환

이 페이지에서는 Speech-to-Text API 및 비동기 음성 인식을 사용하여 긴 오디오 파일(1분 이상)을 텍스트로 변환하는 방법을 설명합니다.

비동기 음성 인식 정보

비동기 음성 인식은 장기 실행 오디오 처리 작업을 시작합니다. 비동기 음성 인식을 사용하여 60초 이상의 오디오를 텍스트로 변환합니다. 이보다 짧은 오디오는 동기 음성 인식이 더 빠르고 더 간단합니다. 비동기 음성 인식의 상한값은 480분입니다.

Speech-to-Text 및 비동기 처리

오디오 콘텐츠는 비동기 처리를 위해 로컬 파일에서 Speech-to-Text로 직접 보낼 수 있습니다. 하지만 로컬 파일의 오디오 시간 제한은 60초입니다. 60초보다 긴 로컬 오디오 파일을 텍스트로 변환하려고 하면 오류가 발생합니다. 비동기식 음성 인식을 사용하여 60초보다 긴 오디오를 텍스트로 변환하려면 데이터를 Google Cloud Storage 버킷에 저장해야 합니다.

google.longrunning.Operations 메서드를 사용하여 작업 결과를 검색할 수 있습니다. 결과는 5일(120시간) 동안 검색할 수 있습니다. Google Cloud Storage 버킷에 결과를 직접 업로드할 수도 있습니다.

Google Cloud Storage 파일을 사용하여 긴 오디오 파일 텍스트 변환

이 샘플은 Cloud Storage 버킷을 사용하여 장기 실행 텍스트 변환 프로세스의 원시 오디오 입력을 저장합니다. 일반적인 longrunningrecognize 작업 응답 예시는 참조 문서를 참조하세요.

프로토콜

자세한 내용은 speech:longrunningrecognize API 엔드포인트를 참조하세요.

동기 음성 인식을 수행하려면 POST 요청을 하고 적절한 요청 본문을 제공합니다. 다음은 curl을 사용한 POST 요청의 예시입니다. 이 예시에서는 Google Cloud CLI를 사용하여 액세스 토큰을 생성합니다. gcloud CLI 설치에 대한 안내는 빠른 시작을 참조하세요.

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'config': {
    'language_code': 'en-US'
  },
  'audio':{
    'uri':'gs://cloud-samples-tests/speech/brooklyn.flac'
  }
}" "https://speech.googleapis.com/v1/speech:longrunningrecognize"

요청 본문 구성에 대한 자세한 내용은 RecognitionConfig 및 RecognitionAudio 참조 문서를 확인하세요.

요청이 성공하면 서버가 200 OK HTTP 상태 코드와 응답을 JSON 형식으로 반환합니다.

{
  "name": "7612202767953098924"
}

여기서 name는 요청에 대해 생성된 장기 실행 작업 이름입니다.

처리가 완료될 때까지 기다립니다. 처리 시간은 소스 오디오에 따라 다르며, 대부분의 경우에는 소스 오디오 길이의 절반으로 표시됩니다. 장기 실행 작업의 상태는 https://speech.googleapis.com/v1/operations/ 엔드포인트에 GET 요청을 실행하여 알아볼 수 있습니다. your-operation-name을 longrunningrecognize 요청으로부터 반환된 name으로 대체합니다.

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     "https://speech.googleapis.com/v1/operations/your-operation-name"

요청이 성공하면 서버가 200 OK HTTP 상태 코드와 응답을 JSON 형식으로 반환합니다.

{
  "name": "7612202767953098924",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2017-07-20T16:36:55.033650Z",
    "lastUpdateTime": "2017-07-20T16:37:17.158630Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "transcript": "how old is the Brooklyn Bridge",
            "confidence": 0.96096134,
          }
        ]
      },
      {
        "alternatives": [
          {
            ...
          }
        ]
      }
    ]
  }
}

작업이 완료되지 않았으면 응답의 done 속성이 true가 될 때까지 GET 요청을 반복해서 엔드포인트를 폴링할 수 있습니다.

gcloud

자세한 내용은 recognize-long-running 명령어를 참조하세요.

비동기 음성 인식을 수행하려면 Google Cloud CLI를 사용하여 로컬 파일의 경로 또는 Google Cloud Storage URL을 제공합니다.

gcloud ml speech recognize-long-running \
    'gs://cloud-samples-tests/speech/brooklyn.flac' \
     --language-code='en-US' --async

요청이 성공하면 서버가 장기 실행 작업의 ID를 JSON 형식으로 반환합니다.

{
  "name": OPERATION_ID
}

그런 후 다음 명령어를 실행하여 작업에 대한 정보를 얻을 수 있습니다.

gcloud ml speech operations describe OPERATION_ID

또한 다음 명령어를 실행하여 작업이 완료될 때까지 작업을 폴링할 수 있습니다.

gcloud ml speech operations wait OPERATION_ID

작업이 완료되면 오디오 스크립트가 JSON 형식으로 반환됩니다.

{
  "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.9840146,
          "transcript": "how old is the Brooklyn Bridge"
        }
      ]
    }
  ]
}

Go

Speech-to-Text용 클라이언트 라이브러리를 설치하고 사용하는 방법은 Speech-to-Text 클라이언트 라이브러리를 참조하세요. 자세한 내용은 Speech-to-Text Go API 참조 문서를 확인하세요.

Speech-to-Text에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


func sendGCS(w io.Writer, client *speech.Client, gcsURI string) error {
	ctx := context.Background()

	// Send the contents of the audio file with the encoding and
	// and sample rate information to be transcripted.
	req := &speechpb.LongRunningRecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},
		},
	}

	op, err := client.LongRunningRecognize(ctx, req)
	if err != nil {
		return err
	}
	resp, err := op.Wait(ctx)
	if err != nil {
		return err
	}

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(w, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
		}
	}
	return nil
}

Java

Speech-to-Text용 클라이언트 라이브러리를 설치하고 사용하는 방법은 Speech-to-Text 클라이언트 라이브러리를 참조하세요. 자세한 내용은 Speech-to-Text Java API 참조 문서를 확인하세요.

Speech-to-Text에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * Performs non-blocking speech recognition on remote FLAC file and prints the transcription.
 *
 * @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
 */
public static void asyncRecognizeGcs(String gcsUri) throws Exception {
  // Configure polling algorithm
  SpeechSettings.Builder speechSettings = SpeechSettings.newBuilder();
  TimedRetryAlgorithm timedRetryAlgorithm =
      OperationTimedPollAlgorithm.create(
          RetrySettings.newBuilder()
              .setInitialRetryDelay(Duration.ofMillis(500L))
              .setRetryDelayMultiplier(1.5)
              .setMaxRetryDelay(Duration.ofMillis(5000L))
              .setInitialRpcTimeout(Duration.ZERO) // ignored
              .setRpcTimeoutMultiplier(1.0) // ignored
              .setMaxRpcTimeout(Duration.ZERO) // ignored
              .setTotalTimeout(Duration.ofHours(24L)) // set polling timeout to 24 hours
              .build());
  speechSettings.longRunningRecognizeOperationSettings().setPollingAlgorithm(timedRetryAlgorithm);

  // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
  try (SpeechClient speech = SpeechClient.create(speechSettings.build())) {

    // Configure remote file request for FLAC
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.FLAC)
            .setLanguageCode("en-US")
            .setSampleRateHertz(16000)
            .build();
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);
    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    for (SpeechRecognitionResult result : results) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcription: %s\n", alternative.getTranscript());
    }
  }
}

Node.js

Speech-to-Text용 클라이언트 라이브러리를 설치하고 사용하는 방법은 Speech-to-Text 클라이언트 라이브러리를 참조하세요. 자세한 내용은 Speech-to-Text Node.js API 참조 문서를 확인하세요.

Speech-to-Text에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
};

const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
const [operation] = await client.longRunningRecognize(request);
// Get a Promise representation of the final result of the job
const [response] = await operation.promise();
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log(`Transcription: ${transcription}`);

Python

Speech-to-Text용 클라이언트 라이브러리를 설치하고 사용하는 방법은 Speech-to-Text 클라이언트 라이브러리를 참조하세요. 자세한 내용은 Speech-to-Text Python API 참조 문서를 확인하세요.

Speech-to-Text에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

def transcribe_gcs(gcs_uri: str) -> str:
    """Asynchronously transcribes the audio file specified by the gcs_uri.

    Args:
        gcs_uri: The Google Cloud Storage path to an audio file.

    Returns:
        The generated transcript from the audio file provided.
    """
    from google.cloud import speech

    client = speech.SpeechClient()

    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=44100,
        language_code="en-US",
    )

    operation = client.long_running_recognize(config=config, audio=audio)

    print("Waiting for operation to complete...")
    response = operation.result(timeout=90)

    transcript_builder = []
    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        transcript_builder.append(f"\nTranscript: {result.alternatives[0].transcript}")
        transcript_builder.append(f"\nConfidence: {result.alternatives[0].confidence}")

    transcript = "".join(transcript_builder)
    print(transcript)

    return transcript

추가 언어

C#: 클라이언트 라이브러리 페이지의 C# 설정 안내를 따른 다음 .NET용 Speech-to-Text 참조 문서를 참조하세요.

PHP: 클라이언트 라이브러리 페이지의 PHP 설정 안내를 따른 다음 PHP용 Speech-to-Text 참조 문서를 참조하세요.

Ruby: 클라이언트 라이브러리 페이지의 Ruby 설정 안내를 따른 다음 Ruby용 Speech-to-Text 참조 문서를 참조하세요.

스크립트 작성 결과를 Cloud Storage 버킷에 업로드

Speech-to-Text는 장기 실행 인식 결과를 Cloud Storage 버킷에 직접 업로드하는 기능을 지원합니다. Cloud Storage 트리거로 이 기능을 구현하면 Cloud Storage 업로드가 Cloud Functions를 호출하는 알림을 트리거할 수 있고 인식 결과에 Speech-to-Text를 폴링할 필요가 없습니다.

결과를 Cloud Storage 버킷에 업로드하려면 장기 실행 인식 요청에서 선택사항인 TranscriptOutputConfig 출력 구성을 제공합니다.

  message TranscriptOutputConfig {

    oneof output_type {
      // Specifies a Cloud Storage URI for the recognition results. Must be
      // specified in the format: `gs://bucket_name/object_name`
      string gcs_uri = 1;
    }
  }

프로토콜

자세한 내용은 longrunningrecognize API 엔드포인트를 참조하세요.

다음 예시에서는 curl을 사용하여 POST 요청을 보내는 방법을 보여줍니다. 여기서는 요청의 본문에서 Cloud Storage 버킷 경로를 지정합니다. 결과는 이 위치에 SpeechRecognitionResult를 저장하는 JSON 파일로 업로드됩니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'config': {...},
  'output_config': {
     'gcs_uri':'gs://bucket/result-output-path.json'
  },
  'audio': {
    'uri': 'gs://bucket/audio-path'
  }
}" "https://speech.googleapis.com/v1p1beta1/speech:longrunningrecognize"

LongRunningRecognizeResponse에는 업로드가 시도된 Cloud Storage 버킷에 대한 경로가 포함됩니다. 업로드에 실패하면 출력 오류가 반환됩니다. 이름이 같은 파일이 이미 있는 경우 업로드는 타임스탬프가 서픽스로 추가된 새 파일에 결과를 씁니다.

{
  ...
  "metadata": {
    ...
    "outputConfig": {...}
  },
  ...
  "response": {
    ...
    "results": [...],
    "outputConfig": {
      "gcs_uri":"gs://bucket/result-output-path"
    },
    "outputError": {...}
  }
}

직접 사용해 보기

Google Cloud를 처음 사용하는 경우 계정을 만들어 실제 시나리오에서 Speech-to-Text의 성능을 평가합니다. 신규 고객에게는 워크로드를 실행, 테스트, 배포하는 데 사용할 수 있는 $300의 무료 크레딧이 제공됩니다.

무료로 Speech-to-Text 사용해 보기