Machine learning datasets for large language model training

Brian Mendoza
Brian Mendoza Member Posts: 2
edited November 21 in English Forum

Hello, I am a machine learning enthusiast and a long-time user of Logos. I would love to be able to create closed-source machine learning datasets for the books I’ve purchased through Logos. They would need to be in raw text format. What would currently be the best approach for this task?

My eventual goal is to train a not-for-profit large language model (i.e. in the same category as ChatGPT) on the datasets in order to have conversations with it about biblical texts.

I would appreciate any help for this project. Thank you!

Tagged:

Comments

  • MJ. Smith
    MJ. Smith MVP Posts: 53,403

    I would appreciate any help for this project.

    Welcome to the forums. First, I would check with a lawyer to see if there are any legal pitfalls. Second, I would recognize that Logos resources are in a proprietary format that would need to be stripped down into raw text format ... to the point that I would look for a different source of data - one closest to raw text as a starting point. Third, if I still wanted to use Logos as a source, I would do a small proof of concept run on a narrow topic to see if this is the best approach to the desired end. 

    Orthodox Bishop Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."; Orthodox proverb: "We know where the Church is, we do not know where it is not."

  • DMB
    DMB Member Posts: 13,613 ✭✭✭

    Adding to MJ:

    1. You didn't actually purchase the books; you purchased access, with the Faithlife user agreement as controlling. So, you'd want to look at that, specifically off-loading the text for input into an engine of your choice.

    2. But assuming that, your primary problem will be sample-size. Over-learning. Memorization.  Unless you're passing the data against external engines, in which case, see #1.

    But may be of interest: my Bible software has its own embedded neural nets for quick access by the user.  It's primarily aimed at 'similarity' searching, style matching, and age-estimation, within various on-board texts (hebrew, greek, etc).

    "If myth is ideology in narrative form, then scholarship is myth with footnotes." B. Lincolm 1999.

  • Brian Mendoza
    Brian Mendoza Member Posts: 2

    Thanks for the input. So legally, personal use would be out of the question?

    If not, similarity search would work well as an alternative. What Bible software are you referring to?

  • MJ. Smith
    MJ. Smith MVP Posts: 53,403

    What Bible software are you referring to?

    DMB writes her own to meet her own needs. I almost mentioned her as a source of information in my post but thought I'd let her decide if she wished to be involved in the thread.

    Orthodox Bishop Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."; Orthodox proverb: "We know where the Church is, we do not know where it is not."