Purely extractive Language Model

Question

Given an email thread, I am trying to extract the body of the most recent email.

I used to do that with rules. Now I am testing Large Language Models (LLM) to see if I they provide a less ad hoc solution.

Mistral-7B-Instruct, for instance, seems to understand the task and provides acceptable outputs most of the time.

However, in some cases, it explains the email rather than just copy/paste the relevant chunk.

I have tried dozens of prompts, for instance:

instruction = 'Given the email thread bellow the dotted line, extract verbatim the body of the most recent (top) message. Remove all headers, footers and disclaimers. In your response, do not add any text that was not present in the original message'

And tried to prevent hallucinations by setting the following:

 generation_output = model.generate( model_inputs, do_sample=True, temperature=0.0000001, top_p=0.0000001, top_k=1, max_new_tokens=words )

However, in a few cases, the model still adds explanations and/or hallucinates a bit.

My questions are the following:

Are you aware of any models that could do a better job without fine-tuning? For instance, purely extractive models (as opposed to generative ones).
If generative models are the way to go, is there a way to force the model to just copy/paste?

Best,

Ed

If Mistral works properly most of the time, why not detect programmatically if it returns text not present in the email and when it does, rerun with e.g. different temperature? That seems easier than finding an open-source model with near-perfect performance. — MrMulliner
– MrMulliner, Commented Nov 24, 2023 at 11:55

noe · Accepted Answer · 2023-11-24 11:18:48Z

2

You can prepend each line of the email with a line number and request the LLM to give you the initial and final line numbers of the most recent email, separated by "-". Then, you can parse the output and extract the lines from the original email.

answered Nov 24, 2023 at 11:18

noe

28.5k1 gold badge50 silver badges85 bronze badges

2

$\begingroup$ It sounds like a viable solution. I will give it a try. Thanks! $\endgroup$

mirix
– mirix

2023-11-24 12:00:42 +00:00
Commented Nov 24, 2023 at 12:00
2

$\begingroup$ The idea, per se, works. If I add line numbers, the model provides numbers that are fully consistent with the identified email body (which is not 100% correct, but that is a different question). Thanks again! $\endgroup$

mirix
– mirix

2023-11-24 15:21:06 +00:00
Commented Nov 24, 2023 at 15:21
$\begingroup$ Interestingly, not all Mistral-based models are able to provide line numbers. Mistral Instruct does but Mistral OpenInstruct seems to output always text, entirely disregarding the prompts. $\endgroup$

mirix
– mirix

2023-11-28 08:33:28 +00:00
Commented Nov 28, 2023 at 8:33

Add a comment |

Stack Exchange Network

Purely extractive Language Model

1 Answer 1

Hot Network Questions

Purely extractive Language Model

1 Answer 1

Related

Hot Network Questions