0
$\begingroup$

Given an email thread, I am trying to extract the body of the most recent email.

I used to do that with rules. Now I am testing Large Language Models (LLM) to see if I they provide a less ad hoc solution.

Mistral-7B-Instruct, for instance, seems to understand the task and provides acceptable outputs most of the time.

However, in some cases, it explains the email rather than just copy/paste the relevant chunk.

I have tried dozens of prompts, for instance:

instruction = 'Given the email thread bellow the dotted line, extract verbatim the body of the most recent (top) message. Remove all headers, footers and disclaimers. In your response, do not add any text that was not present in the original message' 

And tried to prevent hallucinations by setting the following:

 generation_output = model.generate( model_inputs, do_sample=True, temperature=0.0000001, top_p=0.0000001, top_k=1, max_new_tokens=words ) 

However, in a few cases, the model still adds explanations and/or hallucinates a bit.

My questions are the following:

  1. Are you aware of any models that could do a better job without fine-tuning? For instance, purely extractive models (as opposed to generative ones).

  2. If generative models are the way to go, is there a way to force the model to just copy/paste?

Best,

Ed

$\endgroup$
1
  • $\begingroup$ If Mistral works properly most of the time, why not detect programmatically if it returns text not present in the email and when it does, rerun with e.g. different temperature? That seems easier than finding an open-source model with near-perfect performance. $\endgroup$ Commented Nov 24, 2023 at 11:55

1 Answer 1

2
$\begingroup$

You can prepend each line of the email with a line number and request the LLM to give you the initial and final line numbers of the most recent email, separated by "-". Then, you can parse the output and extract the lines from the original email.

$\endgroup$
3
  • 2
    $\begingroup$ It sounds like a viable solution. I will give it a try. Thanks! $\endgroup$ Commented Nov 24, 2023 at 12:00
  • 2
    $\begingroup$ The idea, per se, works. If I add line numbers, the model provides numbers that are fully consistent with the identified email body (which is not 100% correct, but that is a different question). Thanks again! $\endgroup$ Commented Nov 24, 2023 at 15:21
  • $\begingroup$ Interestingly, not all Mistral-based models are able to provide line numbers. Mistral Instruct does but Mistral OpenInstruct seems to output always text, entirely disregarding the prompts. $\endgroup$ Commented Nov 28, 2023 at 8:33

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.