Revisions to How do I extract the first integer from text string in a column of a tab-delimited file?

broken link fixed

edit approved May 26, 2019 at 21:59

839
2
10
19

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

added 6 characters in body; edited tags

Source Link

edited Apr 7, 2015 at 13:29

Gilles 'SO- stop being evil'

866.5k
205
1.8k
2.3k

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T

a b c d c.45_46delAA

a1 b2 c3 d4 p.Ala78_Arg80del

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:

c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T

a b c d c.45_46delAA

a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

added 523 characters in body

Source Link

edited Apr 7, 2015 at 2:45

minnimalist

41
1
3

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:

c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T

a b c d c.45_46delAA

a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:

c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:

c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T

a b c d c.45_46delAA

a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

deleted 4 characters in body; edited tags

Source Link

edited Apr 5, 2015 at 23:09

Gilles 'SO- stop being evil'

866.5k
205
1.8k
2.3k

Loading

Source Link

asked Apr 5, 2015 at 22:57

minnimalist

41
1
3

Loading

Stack Exchange Network

Return to Question