Revisions to Need something that is faster than "wc -l"

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:39

1

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Is using a bigger buffer useful?
mmap() vs. reading blocks mmap() vs. reading blocks
mmap, munmap - map or unmap files or devices into memory (shows a sample program)
13.7 Memory-mapped I/O (The GNU C Library)
How to purge disk I/O caches on Linux How to purge disk I/O caches on Linux
How do you empty the buffers and cache on a Linux system?

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Is using a bigger buffer useful?
mmap() vs. reading blocks
mmap, munmap - map or unmap files or devices into memory (shows a sample program)
13.7 Memory-mapped I/O (The GNU C Library)
How to purge disk I/O caches on Linux
How do you empty the buffers and cache on a Linux system?

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Is using a bigger buffer useful?
mmap() vs. reading blocks
mmap, munmap - map or unmap files or devices into memory (shows a sample program)
13.7 Memory-mapped I/O (The GNU C Library)
How to purge disk I/O caches on Linux
How do you empty the buffers and cache on a Linux system?

replaced http://unix.stackexchange.com/ with https://unix.stackexchange.com/

Source Link

edited Apr 13, 2017 at 12:36

Community Bot

1

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Is using a bigger buffer useful?
mmap() vs. reading blocks
mmap, munmap - map or unmap files or devices into memory (shows a sample program)
13.7 Memory-mapped I/O (The GNU C Library)
How to purge disk I/O caches on Linux
How do you empty the buffers and cache on a Linux system?How do you empty the buffers and cache on a Linux system?

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Is using a bigger buffer useful?
mmap() vs. reading blocks
mmap, munmap - map or unmap files or devices into memory (shows a sample program)
13.7 Memory-mapped I/O (The GNU C Library)
How to purge disk I/O caches on Linux
How do you empty the buffers and cache on a Linux system?

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Is using a bigger buffer useful?
mmap() vs. reading blocks
mmap, munmap - map or unmap files or devices into memory (shows a sample program)
13.7 Memory-mapped I/O (The GNU C Library)
How to purge disk I/O caches on Linux
How do you empty the buffers and cache on a Linux system?

replaced http://programmers.stackexchange.com/ with https://softwareengineering.stackexchange.com/

Source Link

edited Apr 12, 2017 at 7:31

Community Bot

1

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Is using a bigger buffer useful?Is using a bigger buffer useful?
mmap() vs. reading blocks
mmap, munmap - map or unmap files or devices into memory (shows a sample program)
13.7 Memory-mapped I/O (The GNU C Library)
How to purge disk I/O caches on Linux
How do you empty the buffers and cache on a Linux system?

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Is using a bigger buffer useful?
mmap() vs. reading blocks
mmap, munmap - map or unmap files or devices into memory (shows a sample program)
13.7 Memory-mapped I/O (The GNU C Library)
How to purge disk I/O caches on Linux
How do you empty the buffers and cache on a Linux system?

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading: