Skip to main content

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del 

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call FormatVariant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del 

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del 

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

added 6 characters in body; edited tags
Source Link
Gilles 'SO- stop being evil'
  • 866.5k
  • 205
  • 1.8k
  • 2.3k

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

c.2458C>T or c.45_46delAA or c.749_754delinsTG 

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11 

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T

a b c d c.45_46delAA

a1 b2 c3 d4 p.Ala78_Arg80del

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del 

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:

c.2458C>T or c.45_46delAA or c.749_754delinsTG 

Similarly, in another file it might read:

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11 

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T

a b c d c.45_46delAA

a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or c.45_46delAA or c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T a b c d c.45_46delAA a1 b2 c3 d4 p.Ala78_Arg80del 

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

added 523 characters in body
Source Link

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:

c.2458C>T or c.45_46delAA or c.749_754delinsTG 

Similarly, in another file it might read:

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11 

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T

a b c d c.45_46delAA

a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:

c.2458C>T or c.45_46delAA or c.749_754delinsTG 

Similarly, in another file it might read:

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11 

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:

c.2458C>T or c.45_46delAA or c.749_754delinsTG 

Similarly, in another file it might read:

p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11 

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible...

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1 2 3 4 c.2458C>T

a b c d c.45_46delAA

a1 b2 c3 d4 p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

deleted 4 characters in body; edited tags
Source Link
Gilles 'SO- stop being evil'
  • 866.5k
  • 205
  • 1.8k
  • 2.3k
Loading
Source Link
Loading