Here's a Python script that does what you want:
#!/usr/bin/env python # -*- coding: ascii -*- """filter.py""" import sys # Get the file and the maximum line-length as command-line arguments filepath = sys.argv[1] maxlen = int(sys.argv[2]) # Initialize a list to store the unfiltered lines lines = [] # Read the data file line-by-line jsonfile = open(filepath, 'r') for line in jsonfile: # Only consider non-empty lines if line: # For "text" lines that are too line, remove the previous line # and also skip the next two line if "text" in line and len(line) > maxlen: lines.pop() next(jsonfile) next(jsonfile) # Add all other lines to the list else: lines.append(line) # Strip trailing comma from the last object lines[-2] = lines[-2].replace(',', '') # Output the lines from the list for line in lines: sys.stdout.write(line)
You could run it like this:
python filter.py data.json 34
Suppose you had the following data file:
[ { "text": "blah blah blah one", "author": "John Doe" }, { "text": "blah blah blah two", "author": "John Doe" }, { "text": "blah blah blah three", "author": "John Doe" } ]
Then running the script as described would produce the following output:
[ { "text": "blah blah blah one", "author": "John Doe" }, { "text": "blah blah blah two", "author": "John Doe" } ]
jq.