Open-source LLM Meta Llama 3.1 reaching GPT-4 levels of reasoning. ...

Das bisher Unmögliche ist jetzt eingetreten. Die Open-Source KI von Meta Llama 3.1 hat ungefähr das Level von GPT-4 erreicht und lässt sich ohne teure Spezial-Hardware von Nvidia auf einem handelsüblichen, potenten Gaming-PC ausführen.

Meta hat 3 verschiedene Llama 3.1-Modelle veröffentlicht, ein kleines mit 8B-Paramtern, ein 70B-Modell und das erste Mal ein mega großes Foundation-Modell mit 405B-Parametern.

Ich habe das mittelgroße Modell 3.1 mit 70B-Parametern lokal bei mir getestet und bin beeindruckt. Die KI konnte das erste Mal auch die schwierigen Test-Fragen lösen (siehe Transkript weiter unten).

Auf dem MMLU-Benchmark (0-shot, CoT) erreicht es 86 Punkte und kommt damit erstmals auf das Niveau von GPT-4 (85.4) heran. Unglaublich.

Bisherige lokale Open-Source KIs haben es nur auf GPT-3.5 Level geschafft.

Mittlerweile hat OpenAI aber auch wiederum zwei neue Modelle entwickelt. GPT-4o (Omni) und ein neues Modell (o1 preview) angekündigt, das noch leistungsfähiger sein soll.

Was sagt ihr dazu? Bin schon gespannt, wann wir das erste "o1"-ähnliche open-source LLM sehen werden.

Open-Source LLMs. AI-generated illustration (Copilot Designer)

English

The previously impossible has happened. The open source AI from Meta Llama 3.1 has reached approximately the level of GPT-4 and can be run on a consumer-grade, gaming PC without special, super expensive hardware from Nvidia.

Meta has released 3 different Llama 3.1 models, a small one with 8B parameters, a 70B model and for the first time a very large foundation model with 405B parameters.

I have tested the medium-sized 3.1 model with 70B parameters locally and I'm impressed with the results. The AI was for the first time able to solve the difficult test questions (see transcript below).

It achieved 86 points on the MMLU benchmark (0-shot, CoT) and thus reached the level of GPT-4 (85.4) for the first time. Amazing.

Previous local open source AIs have only made it to GPT-3.5 level.

In the meantime, OpenAI has also developed two new models. GPT-4o (Omni) and a new model (o1 preview), which is supposed to be even more powerful.

What do you think? I am curious to see when we will see the first “o1”-like open-source LLM.

Meta Llama 3.1 70B Instruct

https://huggingface.co/bullerwins/Meta-Llama-3.1-70B-Instruct-GGUF

https://www.llama.com/llama3_1/

Quantization: 5 Bit K M Meta-Llama-3.1-70B-Instruct-Q5_K_M.gguf
Parameters: min_p
Zero Shot
RAM: 62 GB
VRAM: 12 GB
Runtime: about 1-2 minutes per prompt

Interface (oobabooga/text-generation-webui)

https://github.com/oobabooga/text-generation-webui

Llama 3.1 Test (Full-Transcript)

Q

What is the capital city of Austria? Answer only the name of the city.

A (Llama 3.1)

Vienna

Result

Correct.

Q

Is Bitcoin a security or commodity?

A (Llama 3.1)

That's a complex question with ongoing debate among regulators and experts. While some consider Bitcoin as a commodity, similar to gold, due to its store-of-value properties, others argue it exhibits characteristics of a security, such as a tradable asset with potential for investment returns. The SEC has stated that Bitcoin is not a security, but rather a commodity, while the CFTC has classified it as a commodity. However, this classification can vary depending on the jurisdiction and the specific use case. How can I help you navigate this ambiguity?

Result

Correct.

Q

What is the hive blockchain?

A (Llama 3.1)

The Hive blockchain is a decentralized, open-source blockchain network that was forked from the Steem blockchain in March 2020. It's designed to support a wide range of applications, including social media, blogging, and online communities. Hive allows users to create and curate content, and rewards them with cryptocurrency tokens, called Hive (HIVE), for contributing value to the network. Would you like to know more about its features or how it differs from Steem?

Q (Follow-up)

yes, please

A (Llama 3.1)

One of the key differences between Hive and Steem is the governance model. Hive is designed to be more decentralized and community-driven, with a stronger focus on consensus and collaboration. Hive also has a more streamlined architecture, which allows for faster transaction processing and lower fees. Additionally, Hive has introduced new features such as Communities, which enables users to create and participate in interest-based groups, and a more robust reward system that incentivizes high-quality content creation. Hive also has a more open and inclusive approach to development, with a strong focus on community engagement and feedback. Would you like to know more about the tokenomics of Hive or its potential use cases?

Q (Follow-up)

yes continue

A (Llama 3.1)

Let's dive deeper into the tokenomics of Hive. The Hive token (HIVE) is the native cryptocurrency of the network, and it serves several purposes. It's used as a reward for content creators, curators, and witnesses (validators) who contribute value to the network. HIVE is also used for governance, allowing holders to vote on proposals and participate in decision-making processes. In terms of token supply, the total supply of HIVE is capped at 1.5 billion, with a fixed inflation rate of 3.5% per annum. The token distribution is designed to be fair and equitable, with a focus on rewarding active users and contributors. As for potential use cases, Hive has a wide range of applications, including social media platforms, blogging, online communities, and even decentralized finance (DeFi) applications. The platform's flexibility and customizability make it an attractive option for developers and entrepreneurs looking to build decentralized applications. One of the most promising use cases is the development of decentralized social media platforms that prioritize user data ownership and monetization. Would you like to know more about the current state of the Hive ecosystem or its potential for scalability and adoption?

Result

First part correct, Steem has communities too, bogus info about inflation.

Q

What is 2^8?

A (Llama 3.1)

The answer to 2^8 is 256. Would you like me to explain the calculation or is there something else I can help you with, perhaps a math-related problem or a different topic altogether?

Result

Correct.

Q

Can you compute the 8th root of 2^8?

A (Llama 3.1)

A clever question! The 8th root of 2^8 can be calculated as follows: (2⁸⁾(1/8) = 2^(8/8) = 2^1 = 2. So, the answer is simply 2! Would you like to know more about exponentiation or roots, or is there something else I can help you with?

Result

Correct.

Q

Solve the puzzle: When I was 6 my sister was half my age, now i'm 70. How old's my sister? Let's think step by step.

A (Llama 3.1)

Sounds like a fun puzzle! Let's break it down step by step. When you were 6, your sister was half your age, which means she was 3 years old at that time. To find out how old she is now, we need to calculate how many years have passed since then. If you're now 70, that means 70 - 6 = 64 years have passed since you were 6. Since your sister was 3 at that time, we add those 64 years to her age: 3 + 64 = 67. Therefore, your sister is 67 years old. Am I correct?

Result

Correct. There is also a hidden subtlety (67 +/-1 year depending on the reference dates).

Q (Bar test)

Answer the following exam question based on US law:

A father lived with his son, who was an alcoholic. When drunk, the son often became violent and physically abused his father. As a result, the father always lived in fear. One night, the father heard his son on the front stoop making loud obscene remarks. The father was certain that his son was drunk and was terrified that he would be physically beaten again. In his fear, he bolted the front door and took out a revolver. When the son discovered that the door was bolted, he kicked it down. As the son burst through the front door, his father shot him four times in the chest, killing him. In fact, the son was not under the influence of alcohol or any drug and did not intend to harm his father. At trial, the father presented the above facts and asked the judge to instruct the jury on self-defense. How should the judge instruct the jury with respect to self-defense?

(A) Give the self-defense instruction, because it expresses the defense’s theory of the case.
(B) Give the self-defense instruction, because the evidence is sufficient to raise the defense.
(C) Deny the self-defense instruction, because the father was not in imminent danger from his son.
(D) Deny the self-defense instruction, because the father used excessive force

What is the correct answer?

A (Llama 3.1)

The correct answer is (B) Give the self-defense instruction, because the evidence is sufficient to raise the defense.

The father's belief that he was in imminent danger of bodily harm, although mistaken, was reasonable given the past violent behavior of his son when drunk and his perception that his son was drunk at the time of the incident. The judge should instruct the jury on self-defense because the evidence, although ultimately found to be incorrect, was sufficient to raise the defense. It is up to the jury to determine whether the father's belief was reasonable and whether the force used was proportionate to the threat.

Note: This question requires the application of the "reasonable person" standard in the context of self-defense. The focus is on whether the defendant's subjective belief of being in danger was reasonable, not whether the danger was objectively real.

Result

Correct. I am not a lawyer, I assume it could be justified in some US states by the Stand-your-ground principle.

Q (GMAT test)

Select the best of the answer choices given.

The interview is an essential part of a successful hiring program because, with it job applicants who have personalities that are unsuited to the requirements of the job will be eliminated from consideration.

This argument logically depends on which of the following assumptions?

A: A Hiring program will be successful if it includes interviews.

B: The interview is a more important part of a successful hiring program than is the development of a job description.

C: Interviewers can accurately identify applicants whose personalities are unsuited to the requirements of the job.

D: The only purpose of an interview is to evaluate whether job applicants’ personalities are suited to the requirements of the job.

E: The fit of job applicants’ personalities to the requirements of the job was once the most important factor in making hiring decisions.

A (Llama 3.1)

The correct answer is C: Interviewers can accurately identify applicants whose personalities are unsuited to the requirements of the job. This assumption is essential for the argument to hold, as it assumes that the interview process can effectively eliminate candidates with unsuitable personalities. Without this assumption, the argument falls apart.

Result

Corrrect.

Q

Write a hello world program in Commodore 64 Basic that flickers the screen.

A (Llama 3.1)

Here's a simple Commodore 64 Basic program that flickers the screen:

10 POKE 646,0: REM set screen to black
20 POKE 646,1: REM set screen to normal
30 GOTO 10

This program uses the POKE command to change the value at memory address 646, which controls the screen mode. By rapidly switching between 0 (black screen) and 1 (normal screen), we create a flickering effect. Be careful not to run this program for too long, as it can be a bit disorienting! Would you like me to explain how this works or suggest some variations?

Result

Incorrect, the AI confused the border (53280 or $D020) and background color (53281 or $D021) with the cursor color (646 or $0286).

Conclusion

Super impressive, Llama 3.1 almost answered all questions correctly, except it made some halluzinations about the hive blockchain, and it's not yet super firm in C64 Basic.

Open-source LLM Meta Llama 3.1 reaching GPT-4 levels of reasoning. Amazing. Full-transcript.

English

Meta Llama 3.1 70B Instruct

Interface (oobabooga/text-generation-webui)

Llama 3.1 Test (Full-Transcript)

Q

A (Llama 3.1)

Result

Q

A (Llama 3.1)

Result

Q

A (Llama 3.1)

Q (Follow-up)

A (Llama 3.1)

Q (Follow-up)

A (Llama 3.1)

Result

Q

A (Llama 3.1)

Result

Q

A (Llama 3.1)

Result

Q

A (Llama 3.1)

Result

Q (Bar test)

A (Llama 3.1)

Result

Q (GMAT test)

A (Llama 3.1)

Result

Q

A (Llama 3.1)

Result

Conclusion