Revisions to How do I extract the first integer from text string in a column of a tab-delimited file?

added 293 characters in body

edited Apr 6, 2015 at 19:08

59.4k
10
123
246

<infile \ cut -f5 | tr -cs '0-9\n' \\t |cut| expand -t1,2,4 | cut -d' ' -f-2

This means that in the output the first integer will be in either the first or second field - because the first field is now either empty (led by a <<!>tab>) or your digit sequence depending on whether it was prefixed as you note. So I expand the 1st and 2cd <<!>tab>-stop positions on a line to a single space a piece, and the third to spaces - which effectively pads out a list of space-delimited fields into having either an empty first field or an empty third field. From there I can just cut those out, too the first two fields.

  2458   45   78

<infile cut -f5 | tr -cs '0-9\n' \\t |cut -f-2

This means that in the output the first integer will be in either the first or second field - because the first field is now either empty (led by a <<!>tab>) or your digit sequence depending on whether it was prefixed as you note. So I just cut those out, too.

  2458   45   78

<infile \ cut -f5 | tr -cs '0-9\n' \\t | expand -t1,2,4 | cut -d' ' -f-2

This means that in the output the first integer will be in either the first or second field - because the first field is now either empty (led by a <<!>tab>) or your digit sequence depending on whether it was prefixed as you note. So I expand the 1st and 2cd <<!>tab>-stop positions on a line to a single space a piece, and the third to spaces - which effectively pads out a list of space-delimited fields into having either an empty first field or an empty third field. From there I can just cut out the first two fields.

 2458 45 78

added 1413 characters in body

Source Link

edited Apr 6, 2015 at 17:40

mikeserv

59.4k
10
123
246

With sed you can substitute by occurrence - so you just ask for the fifth tab delimited<<!>\tab>-delimited ^[1] field and for any numbers within it by ruling out other possible matches:

(Beware though that the \t escape is not a standard one - especially in the [bracket-expression]. You should probably use a real <<!>tab> in its place.)

<infile cut -f5 | tr -cs '0-9\n' \\t |cut -f2f-2

...which first cuts the fifth <<!>tab>-delimited (cut is <<!>tab>-delimited by default, but that can be changed with -d)^[2] field of data per line in full (to avoid issues which may be caused by multiple integers per field) and then translates into a single <<!>tab> every -squeezed sequence of characters -complementary to the set of \newlines and standard digits (some other locale-specific numeral sets may also be included in the [:digit:] class but will not be in [0-9]* which are standardized sequentially as [:digit:]s for all locales)0-9 standard digits ^[3].

This means that in the output the first integer will be in either the first or second field - because there's athe first field is now either empty <<(led by a <<!>tab>) at the head of each lineor your digit sequence depending on whether it was prefixed as you note. So I just cut thatthose out, too.

 2458  45  78

...were my results for the example I used because they were all led by [cp]. and so all had leading <<!>tab> s but those without would be staggered to the left. To additionally condense all results to a single line with each integer separated by a single space you can just append |xargs to the command and get instead:

2458 45 78

Notes

Beware that the \t escape is not a standard one where sed is concerned - and in the context of a [bracket-expression] character class it is arguably even explicitly contrary to the standard as the \backslash and t characters should each represent themselves there. I have used the escape here to more clearly demonstrate a readable intent - but you should probably use a literal <<!>tab> in its place.

cut delimits on <<!>tab> characters by default, and so in this case the common -d [delim-char] option is unnecessary - but also added this note to explain why.

As is noted in the link, the POSIX-standard requires that the [:digit:] character class include the 0123456789 characters in all locales and in that sorting order and sorted ahead of any other inclusions in that class. Non C-locales may also include other localized numeral sets - which a GNU tr probably will not handle appropriately as they are likely represented by multiple bytes - but only the standard numeral set is more likely the least surprising result in most cases anyway, and so using [:digit:] unless you definitely want to match characters in both the standard Arabic numeral set and some other locale-dependent set of numerals is probably not advisable.

With sed you can substitute by occurrence - so you just ask for the fifth tab delimited field and for any numbers within it by ruling out other possible matches:

(Beware though that the \t escape is not a standard one - especially in the [bracket-expression]. You should probably use a real <<!>tab> in its place.)

<infile cut -f5 | tr -cs '0-9\n' \\t |cut -f2

...which first cuts the fifth <<!>tab>-delimited (cut is <<!>tab>-delimited by default, but that can be changed with -d) field of data per line in full (to avoid issues which may be caused by multiple integers per field) and then translates into a single <<!>tab> every -squeezed sequence of characters -complementary to the set of \newlines and standard digits (some other locale-specific numeral sets may also be included in the [:digit:] class but will not be in [0-9]* which are standardized sequentially as [:digit:]s for all locales).

This means that in the output the first integer will be the second field - because there's a <<!>tab> at the head of each line. So I just cut that out, too.

2458 45 78

With sed you can substitute by occurrence - so you just ask for the fifth <<!>\tab>-delimited ^[1] field and for any numbers within it by ruling out other possible matches:

<infile cut -f5 | tr -cs '0-9\n' \\t |cut -f-2

...which first cuts the fifth <<!>tab>-delimited ^[2] field of data per line in full (to avoid issues which may be caused by multiple integers per field) and then translates into a single <<!>tab> every -squeezed sequence of characters -complementary to the set of \newlines and 0-9 standard digits ^[3].

This means that in the output the first integer will be in either the first or second field - because the first field is now either empty (led by a <<!>tab>) or your digit sequence depending on whether it was prefixed as you note. So I just cut those out, too.

 2458  45  78

...were my results for the example I used because they were all led by [cp]. and so all had leading <<!>tab> s but those without would be staggered to the left. To additionally condense all results to a single line with each integer separated by a single space you can just append |xargs to the command and get instead:

2458 45 78

Notes

Beware that the \t escape is not a standard one where sed is concerned - and in the context of a [bracket-expression] character class it is arguably even explicitly contrary to the standard as the \backslash and t characters should each represent themselves there. I have used the escape here to more clearly demonstrate a readable intent - but you should probably use a literal <<!>tab> in its place.

cut delimits on <<!>tab> characters by default, and so in this case the common -d [delim-char] option is unnecessary - but also added this note to explain why.

As is noted in the link, the POSIX-standard requires that the [:digit:] character class include the 0123456789 characters in all locales and in that sorting order and sorted ahead of any other inclusions in that class. Non C-locales may also include other localized numeral sets - which a GNU tr probably will not handle appropriately as they are likely represented by multiple bytes - but only the standard numeral set is more likely the least surprising result in most cases anyway, and so using [:digit:] unless you definitely want to match characters in both the standard Arabic numeral set and some other locale-dependent set of numerals is probably not advisable.

added 31 characters in body

Source Link

edited Apr 6, 2015 at 16:38

mikeserv

59.4k
10
123
246

With sed you can substitute by occurrence - so you just ask for the fifth tab delimited field and for any numbers within it by ruling out other possible matches:

sed 's/[^\t0-9]*\([0-9]*\)[^\t]*/\1/5' <infile

(Beware though that the \t escape is not a standard one - especially in the [bracket-expression]. You should probably use a real <<!>tab> in its place.)

After doing a copy to my clipboard of the other examples here I did:

xsel -bo | unexpand -a | sed ...

...to makes sure I was working with tabsunexpand -all <<!>tab>-sized space sequences into an actual <<!>tab>. And it printed...

1 2 3 4 2458 6 a b c d 45 a1 b2 c3 d4 78 f6

...which just isolates the first integer in the 5th column. I'm not sure if that's what you want, though. If you just want the first integer from the fifth column on a line all its own, that's far easier (and much faster).

<infile cut -f5 | tr -cs '0-9\n' \\t |cut -f2

...which first cuts the fifth <<!>tab>-delimited (cut is <<!>tab>-delimited by default, but that can be changed with -d) field of data per line in full (to avoid issues which may be caused by multiple integers per field) and then translates into a single <<!>tab> every -squeezed sequence of characters -complementary to the set of \newlines and standard digits ([0-9] are required to be recognized as [:digit:]s in all locales - somesome other locale-specific numeral sets may or may not also be included in the [:digit:] class but will not be in [0-9]* which are standardized sequentially as [:digit:]s for all locales).

This means that in the output the first integer will be the second field - because there's a <<!>tab> at the head of each line. So I just cut that out, too.

2458 45 78

With sed you can substitute by occurrence - so you just ask for the fifth tab delimited field and for any numbers within it by ruling out other possible matches:

sed 's/[^\t0-9]*\([0-9]*\)[^\t]*/\1/5' <infile

(Beware though that the \t escape is not a standard one - especially in the [bracket-expression]. You should probably use a real <<!>tab> in its place.)

After doing a copy to my clipboard of the other examples here I did:

xsel -bo | unexpand -a | sed ...

...to makes sure I was working with tabs. And it printed...

1 2 3 4 2458 6 a b c d 45 a1 b2 c3 d4 78 f6

...which just isolates the first integer in the 5th column. I'm not sure if that's what you want, though. If you just want the first integer from the fifth column on a line all its own, that's far easier (and much faster).

<infile cut -f5 | tr -cs '0-9\n' \\t |cut -f2

...which first cuts the fifth <<!>tab>-delimited (cut is <<!>tab>-delimited by default, but that can be changed with -d) field of data per line in full (to avoid issues which may be caused by multiple integers per field) and then translates into a single <<!>tab> every -squeezed sequence of characters -complementary to the set \newlines and standard digits ([0-9] are required to be recognized as [:digit:]s in all locales - some other locale-specific numeral sets may or may not also be included in the [:digit:] class but will not be in [0-9]*).

This means that in the output the first integer will be the second field - because there's a <<!>tab> at the head of each line. So I just cut that out, too.

2458 45 78

With sed you can substitute by occurrence - so you just ask for the fifth tab delimited field and for any numbers within it by ruling out other possible matches:

sed 's/[^\t0-9]*\([0-9]*\)[^\t]*/\1/5' <infile

(Beware though that the \t escape is not a standard one - especially in the [bracket-expression]. You should probably use a real <<!>tab> in its place.)

After doing a copy to my clipboard of the other examples here I did:

xsel -bo | unexpand -a | sed ...

...to unexpand -all <<!>tab>-sized space sequences into an actual <<!>tab>. And it printed...

1 2 3 4 2458 6 a b c d 45 a1 b2 c3 d4 78 f6

...which just isolates the first integer in the 5th column. I'm not sure if that's what you want, though. If you just want the first integer from the fifth column on a line all its own, that's far easier (and much faster).

<infile cut -f5 | tr -cs '0-9\n' \\t |cut -f2

...which first cuts the fifth <<!>tab>-delimited (cut is <<!>tab>-delimited by default, but that can be changed with -d) field of data per line in full (to avoid issues which may be caused by multiple integers per field) and then translates into a single <<!>tab> every -squeezed sequence of characters -complementary to the set of \newlines and standard digits (some other locale-specific numeral sets may also be included in the [:digit:] class but will not be in [0-9]* which are standardized sequentially as [:digit:]s for all locales).

This means that in the output the first integer will be the second field - because there's a <<!>tab> at the head of each line. So I just cut that out, too.