Sed script crashing on big file
I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.
&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
to
&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
This is the script:
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi
sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"
exit 0
This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":
script.sh: line 24: 406089 Killed sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"
How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.
Extra info:
after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.
discarding the -i flag and redirecting the output to another file gives an empty file.
command-line text-processing sed
add a comment |
I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.
&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
to
&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
This is the script:
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi
sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"
exit 0
This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":
script.sh: line 24: 406089 Killed sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"
How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.
Extra info:
after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.
discarding the -i flag and redirecting the output to another file gives an empty file.
command-line text-processing sed
2
It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size runulimit -s
or to see all limits runulimit -a
– Terrance
Dec 14 at 21:27
add a comment |
I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.
&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
to
&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
This is the script:
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi
sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"
exit 0
This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":
script.sh: line 24: 406089 Killed sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"
How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.
Extra info:
after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.
discarding the -i flag and redirecting the output to another file gives an empty file.
command-line text-processing sed
I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.
&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
to
&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
This is the script:
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi
sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"
exit 0
This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":
script.sh: line 24: 406089 Killed sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"
How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.
Extra info:
after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.
discarding the -i flag and redirecting the output to another file gives an empty file.
command-line text-processing sed
command-line text-processing sed
edited Dec 15 at 1:33
muru
1
1
asked Dec 14 at 20:43
Josja
335
335
2
It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size runulimit -s
or to see all limits runulimit -a
– Terrance
Dec 14 at 21:27
add a comment |
2
It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size runulimit -s
or to see all limits runulimit -a
– Terrance
Dec 14 at 21:27
2
2
It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run
ulimit -s
or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27
It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run
ulimit -s
or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27
add a comment |
2 Answers
2
active
oldest
votes
General method
- You can split each file into a header and a second file with the data lines
- Then you can easily edit a header separately with your current sed command
- Finally you can concatenate the header and the file with the data lines.
Light-weight tools to manage huge files
- You can use
head
andtail
to create a head file and a data file. You can use
cat
to concatenate the modified head file and the data file.Efficient way to print lines from a massive file using awk, sed, or something else?
Another method is to use split
Test
I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
$ ls -lh --time-style=full-iso huge*
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
The big write operations were between the system partition on an SSD and a data partition on an HDD.
Shellscript
You need enough free space in the file system where you have /tmp
for the huge temporary 'data' file, more than 9 GB according to your original question.
$ LANG=C df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 106G 32G 69G 32% /
This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ $# -ne 2 ]
then
echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
echo "Example: $0 file.in file.out" 1>&2
exit 1
fi
if [ "$1" == "$2" ]
then
echo "The names of the input file and output file must differ"
exit 2
exit
fi
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
echo "Bad input file: the end marker of the header was not found"
exit 3
fi
#echo "endheader=$endheader"
< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 4
fi
# run sed inline on /tmp/header
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
if [ $? -ne 0 ]
then
echo "Failed to convert the header format in /tmp/header"
exit 5
fi
< "$1" tail -n +$(($endheader+1)) > /tmp/tailer
if [ $? -ne 0 ]
then
echo "Failed to create the 'data' file /tmp/tailer"
exit 6
fi
#echo "---"
#cat /tmp/tailer
#echo "---"
cat /tmp/header /tmp/tailer > "$2"
exit 0
1
Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56
add a comment |
sed
is probably NOT the best tool for this, investigate perl
. However, you could restate the problem as:
Extract the Old Header from the giant data file, into a file of its own.
Adjust the extracted Old Header, to make it the New Header.
Replace the Old Header with the New Header in the giant data file.
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
head -n "$endheader" >/tmp/header
trap "/bin/rm -f /tmp/header" EXIT
# do the sed stuff to /tmp/header, I assume it does what you want
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
# Then combine the new header with the rest of the giant data file,
# using `ed` (see `man ed;info Ed`) and here-document
ed "$1" <<EndOfEd
1,${endheader}d
:0r /tmp/header
:wq
EndOfEd
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100963%2fsed-script-crashing-on-big-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
General method
- You can split each file into a header and a second file with the data lines
- Then you can easily edit a header separately with your current sed command
- Finally you can concatenate the header and the file with the data lines.
Light-weight tools to manage huge files
- You can use
head
andtail
to create a head file and a data file. You can use
cat
to concatenate the modified head file and the data file.Efficient way to print lines from a massive file using awk, sed, or something else?
Another method is to use split
Test
I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
$ ls -lh --time-style=full-iso huge*
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
The big write operations were between the system partition on an SSD and a data partition on an HDD.
Shellscript
You need enough free space in the file system where you have /tmp
for the huge temporary 'data' file, more than 9 GB according to your original question.
$ LANG=C df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 106G 32G 69G 32% /
This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ $# -ne 2 ]
then
echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
echo "Example: $0 file.in file.out" 1>&2
exit 1
fi
if [ "$1" == "$2" ]
then
echo "The names of the input file and output file must differ"
exit 2
exit
fi
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
echo "Bad input file: the end marker of the header was not found"
exit 3
fi
#echo "endheader=$endheader"
< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 4
fi
# run sed inline on /tmp/header
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
if [ $? -ne 0 ]
then
echo "Failed to convert the header format in /tmp/header"
exit 5
fi
< "$1" tail -n +$(($endheader+1)) > /tmp/tailer
if [ $? -ne 0 ]
then
echo "Failed to create the 'data' file /tmp/tailer"
exit 6
fi
#echo "---"
#cat /tmp/tailer
#echo "---"
cat /tmp/header /tmp/tailer > "$2"
exit 0
1
Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56
add a comment |
General method
- You can split each file into a header and a second file with the data lines
- Then you can easily edit a header separately with your current sed command
- Finally you can concatenate the header and the file with the data lines.
Light-weight tools to manage huge files
- You can use
head
andtail
to create a head file and a data file. You can use
cat
to concatenate the modified head file and the data file.Efficient way to print lines from a massive file using awk, sed, or something else?
Another method is to use split
Test
I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
$ ls -lh --time-style=full-iso huge*
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
The big write operations were between the system partition on an SSD and a data partition on an HDD.
Shellscript
You need enough free space in the file system where you have /tmp
for the huge temporary 'data' file, more than 9 GB according to your original question.
$ LANG=C df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 106G 32G 69G 32% /
This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ $# -ne 2 ]
then
echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
echo "Example: $0 file.in file.out" 1>&2
exit 1
fi
if [ "$1" == "$2" ]
then
echo "The names of the input file and output file must differ"
exit 2
exit
fi
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
echo "Bad input file: the end marker of the header was not found"
exit 3
fi
#echo "endheader=$endheader"
< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 4
fi
# run sed inline on /tmp/header
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
if [ $? -ne 0 ]
then
echo "Failed to convert the header format in /tmp/header"
exit 5
fi
< "$1" tail -n +$(($endheader+1)) > /tmp/tailer
if [ $? -ne 0 ]
then
echo "Failed to create the 'data' file /tmp/tailer"
exit 6
fi
#echo "---"
#cat /tmp/tailer
#echo "---"
cat /tmp/header /tmp/tailer > "$2"
exit 0
1
Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56
add a comment |
General method
- You can split each file into a header and a second file with the data lines
- Then you can easily edit a header separately with your current sed command
- Finally you can concatenate the header and the file with the data lines.
Light-weight tools to manage huge files
- You can use
head
andtail
to create a head file and a data file. You can use
cat
to concatenate the modified head file and the data file.Efficient way to print lines from a massive file using awk, sed, or something else?
Another method is to use split
Test
I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
$ ls -lh --time-style=full-iso huge*
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
The big write operations were between the system partition on an SSD and a data partition on an HDD.
Shellscript
You need enough free space in the file system where you have /tmp
for the huge temporary 'data' file, more than 9 GB according to your original question.
$ LANG=C df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 106G 32G 69G 32% /
This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ $# -ne 2 ]
then
echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
echo "Example: $0 file.in file.out" 1>&2
exit 1
fi
if [ "$1" == "$2" ]
then
echo "The names of the input file and output file must differ"
exit 2
exit
fi
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
echo "Bad input file: the end marker of the header was not found"
exit 3
fi
#echo "endheader=$endheader"
< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 4
fi
# run sed inline on /tmp/header
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
if [ $? -ne 0 ]
then
echo "Failed to convert the header format in /tmp/header"
exit 5
fi
< "$1" tail -n +$(($endheader+1)) > /tmp/tailer
if [ $? -ne 0 ]
then
echo "Failed to create the 'data' file /tmp/tailer"
exit 6
fi
#echo "---"
#cat /tmp/tailer
#echo "---"
cat /tmp/header /tmp/tailer > "$2"
exit 0
General method
- You can split each file into a header and a second file with the data lines
- Then you can easily edit a header separately with your current sed command
- Finally you can concatenate the header and the file with the data lines.
Light-weight tools to manage huge files
- You can use
head
andtail
to create a head file and a data file. You can use
cat
to concatenate the modified head file and the data file.Efficient way to print lines from a massive file using awk, sed, or something else?
Another method is to use split
Test
I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
$ ls -lh --time-style=full-iso huge*
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
The big write operations were between the system partition on an SSD and a data partition on an HDD.
Shellscript
You need enough free space in the file system where you have /tmp
for the huge temporary 'data' file, more than 9 GB according to your original question.
$ LANG=C df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 106G 32G 69G 32% /
This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ $# -ne 2 ]
then
echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
echo "Example: $0 file.in file.out" 1>&2
exit 1
fi
if [ "$1" == "$2" ]
then
echo "The names of the input file and output file must differ"
exit 2
exit
fi
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
echo "Bad input file: the end marker of the header was not found"
exit 3
fi
#echo "endheader=$endheader"
< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 4
fi
# run sed inline on /tmp/header
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
if [ $? -ne 0 ]
then
echo "Failed to convert the header format in /tmp/header"
exit 5
fi
< "$1" tail -n +$(($endheader+1)) > /tmp/tailer
if [ $? -ne 0 ]
then
echo "Failed to create the 'data' file /tmp/tailer"
exit 6
fi
#echo "---"
#cat /tmp/tailer
#echo "---"
cat /tmp/header /tmp/tailer > "$2"
exit 0
edited Dec 15 at 19:21
answered Dec 14 at 22:07
sudodus
22.8k32874
22.8k32874
1
Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56
add a comment |
1
Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56
1
1
Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56
Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56
add a comment |
sed
is probably NOT the best tool for this, investigate perl
. However, you could restate the problem as:
Extract the Old Header from the giant data file, into a file of its own.
Adjust the extracted Old Header, to make it the New Header.
Replace the Old Header with the New Header in the giant data file.
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
head -n "$endheader" >/tmp/header
trap "/bin/rm -f /tmp/header" EXIT
# do the sed stuff to /tmp/header, I assume it does what you want
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
# Then combine the new header with the rest of the giant data file,
# using `ed` (see `man ed;info Ed`) and here-document
ed "$1" <<EndOfEd
1,${endheader}d
:0r /tmp/header
:wq
EndOfEd
add a comment |
sed
is probably NOT the best tool for this, investigate perl
. However, you could restate the problem as:
Extract the Old Header from the giant data file, into a file of its own.
Adjust the extracted Old Header, to make it the New Header.
Replace the Old Header with the New Header in the giant data file.
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
head -n "$endheader" >/tmp/header
trap "/bin/rm -f /tmp/header" EXIT
# do the sed stuff to /tmp/header, I assume it does what you want
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
# Then combine the new header with the rest of the giant data file,
# using `ed` (see `man ed;info Ed`) and here-document
ed "$1" <<EndOfEd
1,${endheader}d
:0r /tmp/header
:wq
EndOfEd
add a comment |
sed
is probably NOT the best tool for this, investigate perl
. However, you could restate the problem as:
Extract the Old Header from the giant data file, into a file of its own.
Adjust the extracted Old Header, to make it the New Header.
Replace the Old Header with the New Header in the giant data file.
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
head -n "$endheader" >/tmp/header
trap "/bin/rm -f /tmp/header" EXIT
# do the sed stuff to /tmp/header, I assume it does what you want
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
# Then combine the new header with the rest of the giant data file,
# using `ed` (see `man ed;info Ed`) and here-document
ed "$1" <<EndOfEd
1,${endheader}d
:0r /tmp/header
:wq
EndOfEd
sed
is probably NOT the best tool for this, investigate perl
. However, you could restate the problem as:
Extract the Old Header from the giant data file, into a file of its own.
Adjust the extracted Old Header, to make it the New Header.
Replace the Old Header with the New Header in the giant data file.
endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
head -n "$endheader" >/tmp/header
trap "/bin/rm -f /tmp/header" EXIT
# do the sed stuff to /tmp/header, I assume it does what you want
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header
# Then combine the new header with the rest of the giant data file,
# using `ed` (see `man ed;info Ed`) and here-document
ed "$1" <<EndOfEd
1,${endheader}d
:0r /tmp/header
:wq
EndOfEd
edited yesterday
answered Dec 14 at 22:00
waltinator
21.9k74169
21.9k74169
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100963%2fsed-script-crashing-on-big-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run
ulimit -s
or to see all limits runulimit -a
– Terrance
Dec 14 at 21:27