Sed script crashing on big file












6














I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.



&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2


to



&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2


This is the script:



#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi

sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"

exit 0


This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":



script.sh: line 24: 406089 Killed                  sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"


How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.



Extra info:




  • after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.


  • discarding the -i flag and redirecting the output to another file gives an empty file.











share|improve this question




















  • 2




    It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
    – Terrance
    Dec 14 at 21:27
















6














I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.



&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2


to



&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2


This is the script:



#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi

sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"

exit 0


This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":



script.sh: line 24: 406089 Killed                  sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"


How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.



Extra info:




  • after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.


  • discarding the -i flag and redirecting the output to another file gives an empty file.











share|improve this question




















  • 2




    It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
    – Terrance
    Dec 14 at 21:27














6












6








6







I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.



&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2


to



&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2


This is the script:



#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi

sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"

exit 0


This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":



script.sh: line 24: 406089 Killed                  sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"


How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.



Extra info:




  • after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.


  • discarding the -i flag and redirecting the output to another file gives an empty file.











share|improve this question















I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.



&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2


to



&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2


This is the script:



#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi

sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"

exit 0


This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":



script.sh: line 24: 406089 Killed                  sed '
1,20 {
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i "${1}"


How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.



Extra info:




  • after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.


  • discarding the -i flag and redirecting the output to another file gives an empty file.








command-line text-processing sed






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 15 at 1:33









muru

1




1










asked Dec 14 at 20:43









Josja

335




335








  • 2




    It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
    – Terrance
    Dec 14 at 21:27














  • 2




    It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
    – Terrance
    Dec 14 at 21:27








2




2




It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27




It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27










2 Answers
2






active

oldest

votes


















4














General method




  • You can split each file into a header and a second file with the data lines

  • Then you can easily edit a header separately with your current sed command

  • Finally you can concatenate the header and the file with the data lines.


Light-weight tools to manage huge files




  • You can use head and tail to create a head file and a data file.

  • You can use cat to concatenate the modified head file and the data file.


  • Efficient way to print lines from a massive file using awk, sed, or something else?


  • Another method is to use split



Test





  • I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).



    $ ls -lh --time-style=full-iso huge*
    -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
    -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out


  • The big write operations were between the system partition on an SSD and a data partition on an HDD.



Shellscript



You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.



$ LANG=C df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 106G 32G 69G 32% /


This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).



#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ $# -ne 2 ]
then
echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
echo "Example: $0 file.in file.out" 1>&2
exit 1
fi

if [ "$1" == "$2" ]
then
echo "The names of the input file and output file must differ"
exit 2
exit
fi

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
echo "Bad input file: the end marker of the header was not found"
exit 3
fi
#echo "endheader=$endheader"

< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 4
fi

# run sed inline on /tmp/header
sed '
{
:a; N; $!ba
s/(=[^,]*,)n/1 /g
s/(&FCI)n/1 /
s/ORBSYM/n&/g
s/&END/ISYM=1,n//
}' -i /tmp/header

if [ $? -ne 0 ]
then
echo "Failed to convert the header format in /tmp/header"
exit 5
fi

< "$1" tail -n +$(($endheader+1)) > /tmp/tailer

if [ $? -ne 0 ]
then
echo "Failed to create the 'data' file /tmp/tailer"
exit 6
fi

#echo "---"
#cat /tmp/tailer
#echo "---"

cat /tmp/header /tmp/tailer > "$2"

exit 0





share|improve this answer



















  • 1




    Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
    – Josja
    Dec 16 at 21:56



















0














sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:




  1. Extract the Old Header from the giant data file, into a file of its own.


  2. Adjust the extracted Old Header, to make it the New Header.



  3. Replace the Old Header with the New Header in the giant data file.



    endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
    head -n "$endheader" >/tmp/header
    trap "/bin/rm -f /tmp/header" EXIT
    # do the sed stuff to /tmp/header, I assume it does what you want
    sed '
    {
    :a; N; $!ba
    s/(=[^,]*,)n/1 /g
    s/(&FCI)n/1 /
    s/ORBSYM/n&/g
    s/&END/ISYM=1,n//
    }' -i /tmp/header

    # Then combine the new header with the rest of the giant data file,
    # using `ed` (see `man ed;info Ed`) and here-document
    ed "$1" <<EndOfEd
    1,${endheader}d
    :0r /tmp/header
    :wq
    EndOfEd







share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "89"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100963%2fsed-script-crashing-on-big-file%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4














    General method




    • You can split each file into a header and a second file with the data lines

    • Then you can easily edit a header separately with your current sed command

    • Finally you can concatenate the header and the file with the data lines.


    Light-weight tools to manage huge files




    • You can use head and tail to create a head file and a data file.

    • You can use cat to concatenate the modified head file and the data file.


    • Efficient way to print lines from a massive file using awk, sed, or something else?


    • Another method is to use split



    Test





    • I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).



      $ ls -lh --time-style=full-iso huge*
      -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
      -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out


    • The big write operations were between the system partition on an SSD and a data partition on an HDD.



    Shellscript



    You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.



    $ LANG=C df -h /tmp
    Filesystem Size Used Avail Use% Mounted on
    /dev/sda1 106G 32G 69G 32% /


    This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).



    #!/bin/bash

    # $1 : FCIDUMP file to convert from "new format" to "old format"

    if [ $# -ne 2 ]
    then
    echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
    echo "Example: $0 file.in file.out" 1>&2
    exit 1
    fi

    if [ "$1" == "$2" ]
    then
    echo "The names of the input file and output file must differ"
    exit 2
    exit
    fi

    endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
    if [ "$endheader" == "" ]
    then
    echo "Bad input file: the end marker of the header was not found"
    exit 3
    fi
    #echo "endheader=$endheader"

    < "$1" head -n "$endheader" > /tmp/header
    #cat /tmp/header

    if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
    then
    echo "The provided file is already in old FCIDUMP format." 1>&2
    exit 4
    fi

    # run sed inline on /tmp/header
    sed '
    {
    :a; N; $!ba
    s/(=[^,]*,)n/1 /g
    s/(&FCI)n/1 /
    s/ORBSYM/n&/g
    s/&END/ISYM=1,n//
    }' -i /tmp/header

    if [ $? -ne 0 ]
    then
    echo "Failed to convert the header format in /tmp/header"
    exit 5
    fi

    < "$1" tail -n +$(($endheader+1)) > /tmp/tailer

    if [ $? -ne 0 ]
    then
    echo "Failed to create the 'data' file /tmp/tailer"
    exit 6
    fi

    #echo "---"
    #cat /tmp/tailer
    #echo "---"

    cat /tmp/header /tmp/tailer > "$2"

    exit 0





    share|improve this answer



















    • 1




      Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
      – Josja
      Dec 16 at 21:56
















    4














    General method




    • You can split each file into a header and a second file with the data lines

    • Then you can easily edit a header separately with your current sed command

    • Finally you can concatenate the header and the file with the data lines.


    Light-weight tools to manage huge files




    • You can use head and tail to create a head file and a data file.

    • You can use cat to concatenate the modified head file and the data file.


    • Efficient way to print lines from a massive file using awk, sed, or something else?


    • Another method is to use split



    Test





    • I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).



      $ ls -lh --time-style=full-iso huge*
      -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
      -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out


    • The big write operations were between the system partition on an SSD and a data partition on an HDD.



    Shellscript



    You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.



    $ LANG=C df -h /tmp
    Filesystem Size Used Avail Use% Mounted on
    /dev/sda1 106G 32G 69G 32% /


    This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).



    #!/bin/bash

    # $1 : FCIDUMP file to convert from "new format" to "old format"

    if [ $# -ne 2 ]
    then
    echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
    echo "Example: $0 file.in file.out" 1>&2
    exit 1
    fi

    if [ "$1" == "$2" ]
    then
    echo "The names of the input file and output file must differ"
    exit 2
    exit
    fi

    endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
    if [ "$endheader" == "" ]
    then
    echo "Bad input file: the end marker of the header was not found"
    exit 3
    fi
    #echo "endheader=$endheader"

    < "$1" head -n "$endheader" > /tmp/header
    #cat /tmp/header

    if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
    then
    echo "The provided file is already in old FCIDUMP format." 1>&2
    exit 4
    fi

    # run sed inline on /tmp/header
    sed '
    {
    :a; N; $!ba
    s/(=[^,]*,)n/1 /g
    s/(&FCI)n/1 /
    s/ORBSYM/n&/g
    s/&END/ISYM=1,n//
    }' -i /tmp/header

    if [ $? -ne 0 ]
    then
    echo "Failed to convert the header format in /tmp/header"
    exit 5
    fi

    < "$1" tail -n +$(($endheader+1)) > /tmp/tailer

    if [ $? -ne 0 ]
    then
    echo "Failed to create the 'data' file /tmp/tailer"
    exit 6
    fi

    #echo "---"
    #cat /tmp/tailer
    #echo "---"

    cat /tmp/header /tmp/tailer > "$2"

    exit 0





    share|improve this answer



















    • 1




      Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
      – Josja
      Dec 16 at 21:56














    4












    4








    4






    General method




    • You can split each file into a header and a second file with the data lines

    • Then you can easily edit a header separately with your current sed command

    • Finally you can concatenate the header and the file with the data lines.


    Light-weight tools to manage huge files




    • You can use head and tail to create a head file and a data file.

    • You can use cat to concatenate the modified head file and the data file.


    • Efficient way to print lines from a massive file using awk, sed, or something else?


    • Another method is to use split



    Test





    • I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).



      $ ls -lh --time-style=full-iso huge*
      -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
      -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out


    • The big write operations were between the system partition on an SSD and a data partition on an HDD.



    Shellscript



    You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.



    $ LANG=C df -h /tmp
    Filesystem Size Used Avail Use% Mounted on
    /dev/sda1 106G 32G 69G 32% /


    This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).



    #!/bin/bash

    # $1 : FCIDUMP file to convert from "new format" to "old format"

    if [ $# -ne 2 ]
    then
    echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
    echo "Example: $0 file.in file.out" 1>&2
    exit 1
    fi

    if [ "$1" == "$2" ]
    then
    echo "The names of the input file and output file must differ"
    exit 2
    exit
    fi

    endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
    if [ "$endheader" == "" ]
    then
    echo "Bad input file: the end marker of the header was not found"
    exit 3
    fi
    #echo "endheader=$endheader"

    < "$1" head -n "$endheader" > /tmp/header
    #cat /tmp/header

    if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
    then
    echo "The provided file is already in old FCIDUMP format." 1>&2
    exit 4
    fi

    # run sed inline on /tmp/header
    sed '
    {
    :a; N; $!ba
    s/(=[^,]*,)n/1 /g
    s/(&FCI)n/1 /
    s/ORBSYM/n&/g
    s/&END/ISYM=1,n//
    }' -i /tmp/header

    if [ $? -ne 0 ]
    then
    echo "Failed to convert the header format in /tmp/header"
    exit 5
    fi

    < "$1" tail -n +$(($endheader+1)) > /tmp/tailer

    if [ $? -ne 0 ]
    then
    echo "Failed to create the 'data' file /tmp/tailer"
    exit 6
    fi

    #echo "---"
    #cat /tmp/tailer
    #echo "---"

    cat /tmp/header /tmp/tailer > "$2"

    exit 0





    share|improve this answer














    General method




    • You can split each file into a header and a second file with the data lines

    • Then you can easily edit a header separately with your current sed command

    • Finally you can concatenate the header and the file with the data lines.


    Light-weight tools to manage huge files




    • You can use head and tail to create a head file and a data file.

    • You can use cat to concatenate the modified head file and the data file.


    • Efficient way to print lines from a massive file using awk, sed, or something else?


    • Another method is to use split



    Test





    • I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).



      $ ls -lh --time-style=full-iso huge*
      -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
      -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out


    • The big write operations were between the system partition on an SSD and a data partition on an HDD.



    Shellscript



    You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.



    $ LANG=C df -h /tmp
    Filesystem Size Used Avail Use% Mounted on
    /dev/sda1 106G 32G 69G 32% /


    This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).



    #!/bin/bash

    # $1 : FCIDUMP file to convert from "new format" to "old format"

    if [ $# -ne 2 ]
    then
    echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
    echo "Example: $0 file.in file.out" 1>&2
    exit 1
    fi

    if [ "$1" == "$2" ]
    then
    echo "The names of the input file and output file must differ"
    exit 2
    exit
    fi

    endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
    if [ "$endheader" == "" ]
    then
    echo "Bad input file: the end marker of the header was not found"
    exit 3
    fi
    #echo "endheader=$endheader"

    < "$1" head -n "$endheader" > /tmp/header
    #cat /tmp/header

    if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header > /dev/null
    then
    echo "The provided file is already in old FCIDUMP format." 1>&2
    exit 4
    fi

    # run sed inline on /tmp/header
    sed '
    {
    :a; N; $!ba
    s/(=[^,]*,)n/1 /g
    s/(&FCI)n/1 /
    s/ORBSYM/n&/g
    s/&END/ISYM=1,n//
    }' -i /tmp/header

    if [ $? -ne 0 ]
    then
    echo "Failed to convert the header format in /tmp/header"
    exit 5
    fi

    < "$1" tail -n +$(($endheader+1)) > /tmp/tailer

    if [ $? -ne 0 ]
    then
    echo "Failed to create the 'data' file /tmp/tailer"
    exit 6
    fi

    #echo "---"
    #cat /tmp/tailer
    #echo "---"

    cat /tmp/header /tmp/tailer > "$2"

    exit 0






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Dec 15 at 19:21

























    answered Dec 14 at 22:07









    sudodus

    22.8k32874




    22.8k32874








    • 1




      Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
      – Josja
      Dec 16 at 21:56














    • 1




      Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
      – Josja
      Dec 16 at 21:56








    1




    1




    Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
    – Josja
    Dec 16 at 21:56




    Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
    – Josja
    Dec 16 at 21:56













    0














    sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:




    1. Extract the Old Header from the giant data file, into a file of its own.


    2. Adjust the extracted Old Header, to make it the New Header.



    3. Replace the Old Header with the New Header in the giant data file.



      endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
      head -n "$endheader" >/tmp/header
      trap "/bin/rm -f /tmp/header" EXIT
      # do the sed stuff to /tmp/header, I assume it does what you want
      sed '
      {
      :a; N; $!ba
      s/(=[^,]*,)n/1 /g
      s/(&FCI)n/1 /
      s/ORBSYM/n&/g
      s/&END/ISYM=1,n//
      }' -i /tmp/header

      # Then combine the new header with the rest of the giant data file,
      # using `ed` (see `man ed;info Ed`) and here-document
      ed "$1" <<EndOfEd
      1,${endheader}d
      :0r /tmp/header
      :wq
      EndOfEd







    share|improve this answer




























      0














      sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:




      1. Extract the Old Header from the giant data file, into a file of its own.


      2. Adjust the extracted Old Header, to make it the New Header.



      3. Replace the Old Header with the New Header in the giant data file.



        endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
        head -n "$endheader" >/tmp/header
        trap "/bin/rm -f /tmp/header" EXIT
        # do the sed stuff to /tmp/header, I assume it does what you want
        sed '
        {
        :a; N; $!ba
        s/(=[^,]*,)n/1 /g
        s/(&FCI)n/1 /
        s/ORBSYM/n&/g
        s/&END/ISYM=1,n//
        }' -i /tmp/header

        # Then combine the new header with the rest of the giant data file,
        # using `ed` (see `man ed;info Ed`) and here-document
        ed "$1" <<EndOfEd
        1,${endheader}d
        :0r /tmp/header
        :wq
        EndOfEd







      share|improve this answer


























        0












        0








        0






        sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:




        1. Extract the Old Header from the giant data file, into a file of its own.


        2. Adjust the extracted Old Header, to make it the New Header.



        3. Replace the Old Header with the New Header in the giant data file.



          endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
          head -n "$endheader" >/tmp/header
          trap "/bin/rm -f /tmp/header" EXIT
          # do the sed stuff to /tmp/header, I assume it does what you want
          sed '
          {
          :a; N; $!ba
          s/(=[^,]*,)n/1 /g
          s/(&FCI)n/1 /
          s/ORBSYM/n&/g
          s/&END/ISYM=1,n//
          }' -i /tmp/header

          # Then combine the new header with the rest of the giant data file,
          # using `ed` (see `man ed;info Ed`) and here-document
          ed "$1" <<EndOfEd
          1,${endheader}d
          :0r /tmp/header
          :wq
          EndOfEd







        share|improve this answer














        sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:




        1. Extract the Old Header from the giant data file, into a file of its own.


        2. Adjust the extracted Old Header, to make it the New Header.



        3. Replace the Old Header with the New Header in the giant data file.



          endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
          head -n "$endheader" >/tmp/header
          trap "/bin/rm -f /tmp/header" EXIT
          # do the sed stuff to /tmp/header, I assume it does what you want
          sed '
          {
          :a; N; $!ba
          s/(=[^,]*,)n/1 /g
          s/(&FCI)n/1 /
          s/ORBSYM/n&/g
          s/&END/ISYM=1,n//
          }' -i /tmp/header

          # Then combine the new header with the rest of the giant data file,
          # using `ed` (see `man ed;info Ed`) and here-document
          ed "$1" <<EndOfEd
          1,${endheader}d
          :0r /tmp/header
          :wq
          EndOfEd








        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited yesterday

























        answered Dec 14 at 22:00









        waltinator

        21.9k74169




        21.9k74169






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Ask Ubuntu!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100963%2fsed-script-crashing-on-big-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Human spaceflight

            Can not write log (Is /dev/pts mounted?) - openpty in Ubuntu-on-Windows?

            張江高科駅