News:

Ignorance of scripture is ignorance of Christ. —St. Jerome

Main Menu

OpenAI Whisper audio transcription engine...

Started by Strider3000, April 11, 2023, 12:26:12 PM

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

Strider3000

...is insanely good:

He made arguments from Athanasius against Apollinarius.
He said, what God did not become, He did not save.
God became man and united Himself to humanity so that we might be united to God.
Therefore, if God really united Himself to a true humanity and became a human being,
then He truly assumed not only a body but also a soul and had a mind and a will.
And that became orthodoxy.
And those are readings in your units to read Gregory of Nazianzus' short letter
where he famously argues this against Apollinarius.
However, it is important because when we get to the next controversy in Nestorius,
which is the major Christological controversy in the ancient world,
Nestorius is in a way, he falls into error, but he is writing against Apollinarius.
He wants to safeguard the full reality of the humanity of Jesus.
Jesus is a fully human being, body and soul, having a mind and a will that are human.
Well, he also wants to safeguard against Arius that Jesus is truly God,
that the Son, the Logos, is truly God.
So, in a way, Nestorius is a good guy in his intentions, kind of.
But it is not enough to believe against Apollinarius that Jesus is truly human,
with a body and soul, and against Arius that Jesus is truly divine.
You also cannot get on the wrong side of the Mother of God.
And that is what Nestorius did. He got on the wrong side of the Mother of God.
They don't call her the Scepter of Orthodoxy for nothing.
Okay, well, anyway.
Anyway, so, the Nestorian controversy breaks out in 428.
So, we are like a hundred years later.
We have jumped a hundred years ahead.
And it breaks out because of the title that is being used by the people of God in the liturgy,
Theotokos, which I am sure you have heard.
T-H-E-O-T-O-K-O-S
Which means literally, She who bears God, or the Mother of God, the Bearer of God.
People of God are calling Mary the Theotokos, the Mother of God.
And Nestorius rejected. He was the Archbishop of Constantinople.
So, number two in the church, right? You've got Rome and you've got Constantinople.
And he is the Archbishop, the Patriarch.
And he rejects the use of this title in the liturgy, Mother of God.

From soundcloud at about minute 3:00.

Geremia

Did you do the transcription, or does SoundCloud do it?

Strider3000

Quote from: Geremia on April 30, 2023, 02:52:57 PMDid you do the transcription, or does SoundCloud do it?

I did the transcription. I'm using the faster-whisper project from GitHub with an RTX3060 - transcribed a couple thousand hours of YouTube and other content so far. Let me know if there is something you would like me to transcribe.

Geremia

#3
Quote from: Strider3000 on June 28, 2023, 08:15:40 AMfaster-whisper project from GitHub with an RTX3060
I have a Quadro RTX 4000. I'll see if I can try faster-whisper in a Python virtual environment ("python3 -m venv") myself. Thanks.

Update: I've testing it out, and I'm impressed. I wrote a little script to STT all specified audio files in the cwd:

Code (Python) Select
import sys
if len(sys.argv) != 2:
    print("One arg required: regexp of filenames to glob, e.g., '*.m4a'")
    sys.exit(1)

from faster_whisper import WhisperModel
import glob, sys
from tqdm import tqdm
import pysubs2

model_size = "large-v2"

# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16", download_root="/tmp")

# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")

regex=sys.argv[1]
audio_files = glob.glob(regex)

for i in audio_files:
    print(i)
    base = '.'.join(i.split('.')[0:-1])
    segments, info = model.transcribe(i, beam_size=5)
    print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
    txt_file = open(base+'.txt','w')
    results = []
    for s in tqdm(segments):
        toprint = "[%4.f → %4.f] %s" % (s.start, s.end, s.text)
        tqdm.write(toprint)
        print(toprint, file=txt_file) # print to txt file
        segment_dict = {'start':s.start,'end':s.end,'text':s.text}
        results.append(segment_dict)
    txt_file.close()
    subs = pysubs2.load_from_whisper(results)
    subs.save(base+'.srt') # save subtitle file
This generates a plaintext file and an SRT subtitles file.

Geremia

#4
Here's how I did CUDA GPU MarianMT+CTranslate2 NMT machine translation:
Code (python) Select
#!/usr/bin/python3 -u

if len(sys.argv) != 2:
    print('One arg required: German text file to translate → English')
    sys.exit(1)

filename = sys.argv[1]
file = open(filename, 'r')
str_to_encode = file.read()
file.close()

import ctranslate2
import transformers
translator = ctranslate2.Translator('opus-mt-de-en', device='cuda')
tokenizer = transformers.MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-de-en')

# sentence-level segmentation:
import nltk
from nltk.tokenize import sent_tokenize
#nltk.download('punkt')  # no need if already downloaded
sentences = sent_tokenize(str_to_encode, language='german')

# check for sentences >512 characters and split them
def split_long_sent(wordTokenIDs):
    segments = []
    seg = []
    for w in wordTokenIDs:
        if len(seg) < 512:
            seg.append(w)
        else:
            segments.append(seg)
            seg = [w]
    segments.append(seg) #add last one
    return segments

sentencesTokenIDs = []
for sent in sentences:
    encoded = tokenizer.encode(sent)
    if len(encoded) > 512:
        sentencesTokenIDs += split_long_sent(encoded)
    else:
        sentencesTokenIDs += [encoded]

base = '.'.join(filename.split('.')[0:-1])
outfilename = base+' en.txt'
outfile = open(outfilename,'w')

from tqdm import tqdm

for sentTokenIDs in tqdm(sentencesTokenIDs):
    source = tokenizer.convert_ids_to_tokens(sentTokenIDs)
    results = translator.translate_batch([source], beam_size=5) #default beam_size = 2 (1 = greedy search)
    target = results[0].hypotheses[0]
    toprint = tokenizer.decode(tokenizer.convert_tokens_to_ids(target))
    tqdm.write(toprint)
    print(toprint, file=outfile)

outfile.close()

It translated Kaulen, S.J., Die Sprachverwirrung Babel: Linguistisch-Theologische Untersuchungen über Gen. XI., 1-9 fairly well, putting each tokenized sentence in its own ¶.

MarianNMT Helsinki-NLP/opus-mt-de-en model trained on the Opus, the Open Parallel Corpus.

2023-10-30 update:
La mujer cristiana by Schouppe, S.J., was translated using the Helsinki-NLP/opus-mt-es-en and a Spanish-modified version of the above script.

2024-03-16 update:
Concepción católica de la politíca by Julio Meinvielle was translated in English using the same Helsinki-NLP/opus-mt-es-en LMM as above.

2024-03-18 update:
Concepción católica de la economía by Julio Meinvielle translated