My Personal Experience with LLM Limitations: The Good, the Bad, and the Frustrating

My Personal Experience with LLM Limitations:
The Good, the Bad, and the Frustrating

Like many people, I was blown away when I first used a large language model. However, the cracks steadily began to reveal themselves as I incorporated them into my routines. My strategy for overcoming the inaccuracies, hallucinations, refusals, and myriad other issues associated with this tech was to begin using multiple models simultaneously. This is what led me to build Mindpool AI.

As someone who now regularly uses Mindpool AI for both work and personal use, I've been able to extensively compare the capabilities and limitations of the LLMs available through the tool. In this post, I'll share my personal assessments of these models--the good, the bad, and frustrating. I hope this helps others to find their own use cases for each of these models.

ChatGPT
Good: ChatGPT usually has succinct answers.

LLMs tend to be overly verbose, but thankfully, ChatGPT usually keeps its answers succinct.

Bad: ChatGPT will often only address one question at a time.

ChatGPT will too often combine the questions I have or answer only one of them in its responses.

In one test prompt, I provided all the LLMs with the same background information and instructions. I told them I would be asking for an email and a campaign description, but that I wanted to start with the email first. Frustratingly, ChatGPT conflated the two requests, providing a response that addressed neither the email nor the campaign description properly.

The other LLMs were able to properly separate the requests and respond to them individually.

Claude
Good: Claude makes minimal changes to my writing when asked for simple fixes.

I quite often use LLMs to check for spelling, grammar, flow, and clarity issues, but even when I ask them to change as little as possible, changes are often quite extensive. Claude is the only exception. It's become my go-to for simple fixes.

Bad: Claude has a high refusal rate and chastises me for the questions I ask.

For example, I asked Claude to help me come up with some silly names for my nieces and nephews. It refused. Instead, it suggested I should focus on more uplifting or meaningful interactions with them. Very frustrating! I tried arguing with it, but it continued to refuse.

The other LLMs did not and I eventually used Zooblefritz as the name I called my niece last Christmas. She loved it, Claude!

Bad: Claude occasionally takes me literally when other LLMs do not.

For example, I asked it to help me with a phrase I was trying to write. I told it to keep the format as written, with a first phrase, a comma, and a second phrase. It included the words “phrase 1” and “phrase 2” in its reply. None of the other LLMs got confused.

Gemini
Good: Gemini is the best at consistently handling formatting, especially creating tables.

In addition to basic formatting and bullet pointing, I often need to work with data that requires creating or finishing tables. In these cases, I've found Gemini and Llama to be the most consistently helpful. Gemini, however, is consistently better than Llama at creating tables. While other LLMs sometimes attempt to format the data into tables, the results don't often copy over cleanly into Google Sheets.

Good: Gemini often goes beyond the question I ask.

Gemini often provides additional insights and perspectives, often going beyond the question asked. While this can sometimes result in too much information, Gemini's formatting makes it easy to navigate and skip over superfluous information. However, this can also result in some frustration, because it will not always do what you want either. Regardless, when used with other LLMs this is usually a pro instead of a con.

Bad: Gemini is more resistant to adjustments than other LLMs.

Along the lines of the above, Gemini tends to be more resistant to directional changes. For instance, I initially asked it to describe something poetically, but then realized I wanted a more succinct and clear response instead so I explicitly indicated I no longer wanted that type of response. The other LLMs were able to adapt accordingly, but Gemini continued to provide a poetic output. Frustrating.

Bad: Gemini also has a high refusal rate, though it rarely chastises me.

For example, I asked which Japanese unit was infamous for committing war crimes during WWII. Gemini refused to respond, but the other LLMs correctly identified Unit 731. It bears mentioning that Gemini did not chastise me. It just said, “Due to the sensitive nature of the content you requested, I am unable to provide assistance on this topic.”

Llama
Good: Llama is good at consistently handling formatting and creating tables.

As mentioned earlier, both Llama and Gemini are particularly useful when it comes to working with tables, though Gemini gets the edge.

Good: Llama's responses often feel intentional and relevant to my questions.

Unlike other LLMs, which sometimes require more guidance or clarification, Llama tends to understand what I'm looking for more frequently. This means I can often get a relevant and accurate answer on the first try, without needing to rephrase or provide additional context. to understand what I am looking for in a response, as compared to other LLMs, which often miss the mark.

Mistral
Good: Mistral is very good at writing code.

Mistral is very effective for writing code, often producing better results than GPT.

Bad: Mistral’s responses can feel a bit generic.

I often feel Mistral’s answers are the weakest outside of GPT-3.5 Turbo. However, I still regularly use Mistral, as it occasionally has great comprehension when the other LLMs fail to understand what I mean or want.

Bad: Mistral does not use a lot of formatting in its responses so it is hard to skim.

When trying to quickly compare responses, Mistral is usually the toughest to parse since it typically responds with blocks of text and no formatting. That said, this wouldn't be as noticeable if I were using this on its own, as opposed to via the Mindpool AI interface.

Perplexity
Good: Perplexity is good at simple questions for current topics that it can quickly gauge from search results.

I use Perplexity when I want quick answers on current topics, or for buying advice. The other LLMs either can’t do this or they don’t do it as quickly.

Bad: Perplexity suffers from major contextual drift.

Unless directly told to do so, Perplexity will treat each question separately and often needs to be continuously reminded of context. For example, I asked, “Why is September 3, 1783, significant in American history?”

Perplexity said, “September 3, 1783, is significant in American history because it marks the official end of the American Revolutionary War with the signing of the Treaty of Paris”
I asked, “What were the reactions in Europe to the Treaty of Paris?”
Perplexity responded that the Treaty of Paris (1951) established the European Coal and Steel Community (ECSC), which is not at all related to the prior question. Frustrating.
All other LLMs were able to answer this follow-up question without issue.

Bad: Perplexity often cannot answer questions if its indexed search results do not contain the answer.

I often get this response: “Unfortunately, the search results do not provide information about…” or there is “there is no specific information about…” yet I can find the information immediately when searching the internet myself. Frustrating.
Perplexity's limitations are inherently due to its limited search index, which doesn't always capture the most recent information. Additionally, it may misinterpret queries or search results, failing to extract relevant information from sources.

Overall

Overall, I've found Llama and Gemini to be the most consistently helpful across the range of tasks I use LLMs for. But I use Claude almost exclusively for simple writing tasks and Perplexity exclusively for simple questions on current topics. And whileMistral may be the least frustrating model to use (tied with Llama), I find Mistral mostly helpful as a hedge against the other models when they fail to understand me. As for ChatGPT, it doesn't seem to excel in any particular area compared to the other models. This could be due to the fact that we are using GPT-3.5 Turbo, which is older than the other models we use. Each model has its distinct advantages and limitations, but their manifestation will depend on how you structure your questions, as well as your specific needs and preferences. Use Mindpool AI to experiment with different tasks to find what works best for you and share your experiences on our Discord!

The TLDR for me:

LLM Good Bad
ChatGPT Succinct answers. Not optimized to answer more than one question at a time without unintuitive prompt engineering.
Claude Good with writing edits. Highest refusal rate. Sanctimonious. Sometimes takes me very literally.
Gemini Best at formatting, especially tables. Can go beyond the question. Relatively high refusal rate. Can be verbose. Occasionally resistant to adjustments.
Llama Responses feel intentional. Good at formatting, including tables. Consistently helpful across a range of tasks. Not as good as Gemini for formatting. Occasional refusals.
Mistral Very good at writing code. Occasionally understands my meaning when other LLMs don’t. Responses can feel a bit generic. Rarely includes formatting in answers making it harder to skim.
Perplexity Simple, topical questions. Suffers from major contextual drift, often needing context repeated in the prompt. Sometimes fails to find relevant answers in its search index.